PhD Defense: Probabilistic Bayesian Models - Scalable Inference and Application

Talk
Ke Zhai
Time: 
11.07.2014 12:00 to 14:00
Location: 

AVW 4172

Unsupervised probabilistic Bayesian models are powerful tools for statistical analysis, especially in the area of information retrieval, document analysis and text processing. Despite their success, unsupervised models are often slow in inference due to a very large parameter space. As the term "big data" has become more appealing in both academia and industry, lack of scalability for these probabilistic models is a critical bottleneck.
Our primary focus is to speed up inference process in unsupervised Bayesian models. There are two options to enable the analysis of data that do not fit on a single machine: parallelization or streaming. The former achieves scalability by distributing the data and the computation to multiple machines. The latter assumes data come in a stream and updates the model gradually after seeing each data observation. It is able to scale to larger datasets because it takes only one pass over the entire data.
In this thesis, we examine both approaches. We first demonstrate the effectiveness of the parallelization on a class of unsupervised Bayesian models - topic models, which are exemplified by latent Dirichlet allocation (LDA). We propose a fast parallel implementation using variational inference using the MapReduce framework, referred to Mr. LDA. We further show that our implementation, unlike highly tuned and specialized implementations, is easily extensible. We demonstrate two extensions of the models possible with this scalable framework: informed priors to guide topic discovery and extracting topics from a multilingual corpus. We show that parallelization enables it to significantly larger datasets.
We further extend multilingual Mr. LDA to include tree priors and propose three different inference methods to infer the latent variables. We examine the effectiveness of different inference methods on the task of machine translation, which we use the proposed model to extract domain knowledge that considers both source and target languages. We apply it on a corpus of 1.6M aligned Chinese-English sentences and show that our model yields significant improvement over strong baselines.
Other than parallelization, another approach to deal with scalability is to learn parameters in an online streaming setting. Although many online algorithms have been proposed for LDA, they all overlook a fundamental but challenging problem-the vocabulary is constantly evolving over time. To address this problem, we propose an online LDA with infinite vocabulary-infvoc LDA. We using stochastic variational inference and propose heuristics to dynamically order, expand, and contract the set of words in our vocabulary. We show that our algorithm is able to discover new words and constantly expand vocabulary on the fly.
We further generalize online infinite vocabulary LDA to one more level of hierarchy and set both the number of topics and vocabulary size in a completely data-driven settings. We model both the topic and vocabulary with hierarchical Dirichlet processes and extend LDA to a non-parametric Bayesian model, where the parameter space is completely determined by data.
In addition to LDA, we also show full generality of the online hybrid inference approach by applying it to adaptor grammars, which are a broader class of models which subsume LDA. With proper grammar rules, it simplifies to the exact LDA model, however, it provides more flexibility to alter or extend LDA with different grammar rules. We develop a hybrid online inference scheme, and show that our method discovers high-quality structure more quickly than both MCMC and variational inference methods.
Examining Committee:
Committee Chair: - Dr. Jordan Boyd-Graber
Dept. or Dean's Representative: - Dr. Philip Resnik
Committee Members: - Dr. Hal Daume III
- Dr. Jimmy Lin
- Dr. Ramani Duraiswami