PhD Proposal: IMPROVING KNOWLEDGE DISCOVERY: ADVANCED TOPIC AND LANGUAGE MODELS
IRB-5105
Knowledge discovery in textual data is a cornerstone of natural language processing (NLP), driving innovations that enable machines to uncover, interpret, and interact with human language in unprecedented ways. Topic modeling is a popular method for distilling vast corpora into comprehensible themes. In parallel, the Large Language Models (LLM) have revolutionized NLP, offering versatility and power that encapsulates world knowledge. We improve traditional topic modeling techniques by harnessing the capabilities of LLM. We refine and extend the utility of time-tested models while simultaneously exploring the potential of LLM to perform complex language tasks traditionally reliant on tailor-made models. In doing so, this work addresses the balance between the interpretability and accessibility of topic modeling and the broad yet sometimes opaque knowledge within LLM. The core proposition of this research is twofold: first, it introduces enhancements to topic modeling methodologies that increase their adaptability and user engagement, thereby improving the granularity and relevance of extracted topics in an age dominated by LLM. Second, it demonstrates how the vast world knowledge embedded within LLM can be effectively leveraged to perform specialized tasks—such as inferring psychological dispositions from textual data—without the need for constructing bespoke models from scratch. To achieve these advancements, we develop I-NTM, the first architecture for neural interactive topic models. By defining topic embeddings instead of as a representation of words, we can move these physical topic embeddings and adjust the word embedding space around a topic. Giving users the control to adjust topics as they see fit. Empirically, users can find more relevant information in less time. A limitation of previous work is the reliance on regression models for predicting psychological dispositions, which tend to always predict the mean and perform poorly with outliers. Additionally, these models require extensive amounts of training data to achieve satisfactory accuracy. In contrast, we explore the use of an open-source LLM, LLaMA, in a zero-shot setting to address this same task. By leveraging the inherent capabilities of LLaMA, we find comparable results to traditional methods with significantly better performance in handling outliers, all without the necessity for any training data. This approach not only simplifies the process but demonstrates LLM adaptability to diverse tasks. For proposed work we investigate three directions. First, combining the world knowledge of LLM with the interpretability of probabilistic topic models, we seek to balance out the weaknesses of both methods standalone. Finally, we develop two new models that can model language. A topic model that models language acquisition in bilingual children and a transformer based dialectical model that can translate and change topics between dialects.