PhD Proposal: Provenance Management for Collaborative Data Science Workflows

Talk

Hui Miao

Time:

12.15.2016 14:00 to 15:30

Location:

AVW 4424

URL:

https://talks.cs.umd.edu/talks/1607

Collaborative data science activities are becoming pervasive in a variety of communities, and are often conducted in teams, with people of different expertise performing back-and-forth modeling and analysis on time-evolving datasets. Current data science systems mainly focus on specific steps in the process such as training machine learning models, scaling to large data volumes, or serving the data or the models, while the issues of end-to-end data science lifecycle management are largely ignored. Such issues include, for example, tracking provenance and derivation history of models, identifying data processing pipelines and keeping track of their evolution, analyzing unexpected behaviors and monitoring the project health, and providing the ability to reason about specific analysis results. In this proposal, we propose addressing these challenges by ingesting, managing, and analyzing rich provenance information generated during data science projects, and using it to enable users to easily publish, share, and discover data analytics projects.

We first describe the design of our unified provenance and metadata management system, called ProvDB. We adopt a schema-later approach and propose a flexible graph-based provenance representation model that combines the core concepts in version control and provenance management. We describe several ingestion mechanisms for this provenance model and show how heterogeneous data analysis environments can be served with natural extensions to this framework. We also describe a set of novel features of the system including graph queries for retrospective provenance, fileviews for data transformations, introspective queries for debugging, and continuous monitoring queries for anomaly detection. We then illustrate how to support deep learning modeling lifecycle via the extensibility mechanism in ProvDB. We describe techniques to compactly store and efficiently query the rich set of data artifacts generated during deep learning modeling lifecycle. We also describe a high-level domain specific language that helps raise the abstraction level during model exploration and enumeration and accelerate the modeling process.

We conclude with the future work we plan to pursue as continuation of this research, which is divided into three parts: (a) exploring automation techniques to monitor collaborative data science workflows, (b) inferring and summarizing the prospective and retrospective provenance of evolving analytics workflows, and (c) building an online service for collaborative data science projects to publish, share, and discover the massive analytics activities.
Examining Committee:

Chair: Dr. Amol Deshpande

Dept rep: Dr. Larry Davis

Member: Dr. Hector Corrada Bravo

Upcoming Events

Event

04.26.2024 12:00 to 13:30

IRB-4105

Computer Science APT Meeting

Event

04.26.2024 13:00 to 14:00

IRB-5105

Computer Science Instructional Faculty Meeting

Talk

04.26.2024 13:30 to 15:00

ATL 3100A

PhD Proposal: Towards the Verification of Quantum Networks
Yusuf Alnawakhtha

Event

04.26.2024 15:00 to 16:30

IRB-0318

Computer Science Education Committee Meeting

Talk

04.29.2024 11:30 to 12:30

IRB 4107

PhD Proposal: Multi-Agent Autonomous Decision Making in Artificial Intelligence
Saptarashmi Bandyopadhyay

Talk

04.29.2024 15:00 to 16:00

IRB 5105

PhD Proposal: Scaling Policy Gradient Methods to Open-Ended Domains
Ryan Sullivan

Talk

04.30.2024 10:00 to 12:00

IRB 4105

AI Empowered Music Education
Snehesh Shrestha

Talk

04.30.2024 12:30 to 15:00

IRB 4107

Towards Trustworthy Models in Machine Learning
Xiaoyu Liu

Talk

05.01.2024 15:00 to 17:00

IRB IRB-4105

PhD Defense: Feedback for Vision
Michael Maynord

Talk

05.02.2024 12:30 to 14:00

IRB 4107

Towards AI Alignment: Advancing Fairness, Reliability, and Human-Like Perception in AI
Bang An