PhD Proposal: Visual Analytics for Open-Ended Tasks in Text Mining

Talk
Deok Gun Park
Time: 
01.27.2017 15:00 to 16:30
Location: 

AVW 4172

Text mining extracts valuable insights from a text corpus. Many interesting problems in text mining such as identifying characteristics of a group of documents, selecting high-quality comments to promote, or describing an image are open-ended tasks where no ground truth exists. Humans must still provide world knowledge, reasoning, and context for these tasks. However, this approach does not scale to large corpora, and automating them is proving to be a challenging problem. While sophisticated text mining algorithms are becoming increasingly proficient at extracting themes, identifying insightful documents, or labeling images, the lack of ground truth makes it difficult to evaluate and improve them.

My research suggests a general framework for supporting state-of-the-art text mining algorithms using interactive visual representations. This framework consists of four components that integrate with different parts of the standard text mining pipeline: (1) dataset management, (2) feature selection, (3) statistical modeling, and (4) output exploration. To amplify the cognitive ability of the human analyst to manage large data, text data is preprocessed with natural language processing methods for features. These features are then summarized using statistical methods to produce high-level abstractions. Finally, the abstractions are visually presented so that users can explore them interactively to answer open-ended questions. The four components suggest that there can be four interactive loops, each one allowing the human analyst to intervene.

I explore the design space of these four interactive loops with several case studies. First, the output of the system can be explored using interactive visualization. ParallelSpaces examines the understanding of the results of topic modeling for Yelp business reviews, where businesses and their reviews constitute each separate visual space and exploring these spaces enable the characterization of each space using the other. TopicLens is a Magic Lens-type interaction technique, where the documents under the lens are clustered according to topics in real time. Second, based on the output understanding, the user can direct manipulate the model parameters. CommentIQ is a comment moderation tool where moderators can adjust model parameters according to the context or goals. Third, based on user understanding of output results using visualizations, they can sculpt features for the concept they can use in document scoring. ConceptVector uses word embedding to support these process. Finally, based on output understanding, one can teach or improve the specific part of the model with teaching dataset. My planned future project, CaptionViz, visualizes the output of image caption generation model and users can improve the model performance by feeding a complement dataset called the "teaching set."
My dissertation will contribute a general framework for integrating the human in text mining loops that currently are non-interactive. The practical implications of this framework are wide and far-reaching. The case studies I present in this dissertation provide concrete and operational techniques for directly improving several state-of-the-art text mining algorithms. On a more abstract level, I will crystallize the lessons learned from the application of my framework in multiple studies into design guidelines, which will guide the transformation of any linear algorithmic process into an interactive process. This, in turn, can facilitate the scaling of open-ended tasks empowering human analysts in the future.
Keywords: visual analytics, text mining, topic modeling, feature selection, evaluation.
Examining Committee:

Chair: Dr. Niklas Elmqvist

Dept rep: Dr. Hector Corrada Bravo

Members: Dr. Hal Daume III

Dr. Bongshin Lee

Dr. Jaegul Choo