PhD Proposal: Provenance Management for Collaborative Data Science Workflows

Talk
Hui Miao
Time: 
12.15.2016 14:00 to 15:30
Location: 

AVW 4424

Collaborative data science activities are becoming pervasive in a variety of communities, and are often conducted in teams, with people of different expertise performing back-and-forth modeling and analysis on time-evolving datasets. Current data science systems mainly focus on specific steps in the process such as training machine learning models, scaling to large data volumes, or serving the data or the models, while the issues of end-to-end data science lifecycle management are largely ignored. Such issues include, for example, tracking provenance and derivation history of models, identifying data processing pipelines and keeping track of their evolution, analyzing unexpected behaviors and monitoring the project health, and providing the ability to reason about specific analysis results. In this proposal, we propose addressing these challenges by ingesting, managing, and analyzing rich provenance information generated during data science projects, and using it to enable users to easily publish, share, and discover data analytics projects.

We first describe the design of our unified provenance and metadata management system, called ProvDB. We adopt a schema-later approach and propose a flexible graph-based provenance representation model that combines the core concepts in version control and provenance management. We describe several ingestion mechanisms for this provenance model and show how heterogeneous data analysis environments can be served with natural extensions to this framework. We also describe a set of novel features of the system including graph queries for retrospective provenance, fileviews for data transformations, introspective queries for debugging, and continuous monitoring queries for anomaly detection. We then illustrate how to support deep learning modeling lifecycle via the extensibility mechanism in ProvDB. We describe techniques to compactly store and efficiently query the rich set of data artifacts generated during deep learning modeling lifecycle. We also describe a high-level domain specific language that helps raise the abstraction level during model exploration and enumeration and accelerate the modeling process.

We conclude with the future work we plan to pursue as continuation of this research, which is divided into three parts: (a) exploring automation techniques to monitor collaborative data science workflows, (b) inferring and summarizing the prospective and retrospective provenance of evolving analytics workflows, and (c) building an online service for collaborative data science projects to publish, share, and discover the massive analytics activities.
Examining Committee:

Chair: Dr. Amol Deshpande

Dept rep: Dr. Larry Davis

Member: Dr. Hector Corrada Bravo