Skip to main content

Travel Grant - Scaling Latent Topic/Class Models to Big Data Collections and Streams: Applications to Medical Domain

Project Member(s): Xu, G.

Funding or Partner Organisation: Ambassade de France en Australie (Ambassade de France en Australie - Scientific Sect)

Start year: 2014

Summary: Numerous pieces of content are currently exchanged in social media, making them an important form of big data. Searching, filtering, enriching and organizing this information, as well as being able to rapidly discover and utilize the knowledge hidden in such kind of big data, are major challenges faced by researchers from different communities, as information retrieval, data mining and machine learning. Several approaches have been developed in the past to address these challenges, even though not at the scale and speed required by current data collections and streams. Among these different approaches, the ones based on latent topic/class analysis are particularly interesting as they yield state-of-the-art results and allow one to categorize/annotate documents with existing taxonomies (filtering and enriching), to infer new taxonomies or complement existing ones (organizing) and to detect outliers and emerging events (event detection). However, current latent topic models have following major drawbacks that prevent their use on large-scale collections and high-speed streams: (a) social media data, e.g., twitter is usually very short in size, not well structured and syntactically expressed, bringing difficulties of using traditional information retrieval approaches, (b) they mainly handle static data and do not take into account the correlation and dynamics of the data, and (c) the inference and learning mechanisms usually rely on Markov Chain Monte-Carlo (MCMC) methods, which are too slow to be used in the big data era. The goal of this project is precisely to address these problems, by developing novel latent topic models able to reveal semantics in social media, by devising new big data analytics models to handle such temporal and dynamic data with implicit coupling relationships and heterogeneity, and by designing new learning and inference methods able to provide good estimates of the parameters of the new models under real-time and one-pass constraints.

Keywords: Big data, data analytics, latent topic model, research collaboration

FOR Codes: Pattern Recognition and Data Mining, Information Processing Services (incl. Data Entry and Capture)