PhD Proposal: Enriching Spectral Methods for Topic Modeling

Talk
Thang Dai Nguyen
Time: 
02.09.2017 11:00 to 12:30
Location: 

AVW 3258

Topic models have become important tools for uncovering hidden structures in big data. However, the most popular topic model algorithm—Latent Dirichlet Allocation (LDA)— and its extensions suffer from sluggish performance on big datasets. Recently, the machine learning community has attacked this problem using spectral learning approaches, such as the moment method with tensor decomposition or matrix factorization. The anchor algorithm by Arora et al. [2013] has emerged as a more efficient approach to solve a large class of topic modeling problems. The anchor algorithm is very fast and it has a theoretical provable guarantee: it will converge to a global solution given sufficient number of samples. In this proposal we present a series of spectral models based on the anchor algorithm to serve a larger class of datasets and to provide richer and more flexible modeling capacity.
First, we improve the anchor algorithm by incorporating various rich priors in the form of appropriate regularization terms. These enhancements are superior in terms of topic quality and provide a flexibility to incorporate informed priors to discover topics more suited for external knowledge. Second, we enrich the anchor algorithm with metadata-based word representation for handling labeled datasets. Experiments on three sentiment datasets show that the new supervised anchor algorithm runs very fast and predicts better than supervised topic models such as Supervised LDA. In addition, sentiment anchor words, which play an important role for generating sentiment topics, provide cues to understand sentiment datasets. Third, we create a new anchor topic model that can produce hierarchical topics to help with deeper analysis of text corpora. We introduce two new metrics for evaluating the quality of hierarchical topics. In addition, we propose using several synthetic datasets produced by the generative processes similar to that of Hierarchical LDA to thoroughly evaluate the new model. Lastly, we address the problem of topic inference: how can we recover the document- topic distributions for the anchor algorithms with small recovery errors? We do this by formulating it as a semi-definite programming problem using infinity norm.

Examining Committee:

Chair: Dr. Philip Resnik

Dept rep: Dr. Thomas Goldstein

Members: Dr. Jordan Boyd-Graber

Dr. Hal Daume III