Managing and Querying Large-scale Uncertain Databases

The goal of this project is to build a complete probabilistic data management system, called PrDB, that can manage, store, and process large-scale repositories of uncertain data. PrDB unifies ideas from "large-scale structured graphical models" like probabilistic relational models (PRMs), developed in the machine learning literature, and "probabilistic query processing", studied in the database literature. PrDB framework is based on the notion of "shared factors", which not only allows us to express and manipulate uncertainties at various levels of abstractions, but also supports capturing rich correlations among the uncertain data. PrDB supports a declarative SQL-like language for specifying uncertain data and the correlations among them. PrDB also supports exact and approximate evaluation of a wide range of queries including inference queries, SQL queries, and decision-support queries.

Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover;
Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik;
SODA 2014 (also CoRR Technical Report arXiv:1303.0726).

Stochastic Boolean Function Evaluation is the problem of determining the value of a given Boolean function f on an unknown input x, when each bit of x_i of x can only be determined by paying an associated cost c_i. The assumption is that x is drawn from a given product distribution, and the goal is to minimize the expected cost. This problem has been studied in Operations Research, where it is known as "sequential testing" of Boolean ... [more]

Local Structure and Determinism in Probabilistic Databases;
Theodoros Rekatsinas, Amol Deshpande, Lise Getoor;
SIGMOD 2012. [pdf]

While extensive work has been done on evaluating queries under the tuple-independence assumption, query evaluation over correlated data has received much less attention even though the support for correlations is ... [more]

Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems;
Jian Li, and Amol Deshpande;
FOCS 2011 (also CoRR Technical Report arXiv:1012.3189).

We study the stochastic versions of a broad class of combinatorial problems where the weights of the elements in the input dataset are uncertain. The class of problems that we study includes shortest paths, minimum weight spanning ... [more]

Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases;
Bhargav Kanagal, Jian Li, Amol Deshpande;
SIGMOD 2011. [pdf]

Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key ... [more]

A Unified Approach to Ranking in Probabilistic Databases;
Jian Li and Barna Saha and Amol Deshpande;
VLDB Journal, 20(2): 249-275, 2011. [pdf]

Ranking is a fundamental operation in data analysis and decision support, and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank the tuples ... [more]

Ranking Continuous Probabilistic Datasets;
Jian Li, and Amol Deshpande;
VLDB 2010. [pdf]

Read-Once Functions and Query Evaluation in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB 2010. [pdf]

Probabilistic databases hold promise of being a viable means for large-scale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient "exact" query evaluation in probabilistic databases has largely concentrated on query-centric formulations (e.g., "safe plans", "hierarchical queries"), in that, they only consider characteristics of the query and not the data in the database. It is easy to construct examples where a supposedly hard query run on an appropriate database gives rise to a tractable query evaluation problem. In this paper, we develop efficient query evaluation techniques that leverage characteristics of both the query and the data in the database. We focus on tuple-independent databases where the query evaluation problem is equivalent to computing marginal probabilities of Boolean formulas associated with the result tuples. This latter task is easy if the Boolean formulas can be factorized into a form that has every variable appearing at most once (called "read-once"). However, a naive approach that directly uses previously developed Boolean formula factorization algorithms is inefficient, because those algorithms require the input formulas to be in the disjunctive normal form (DNF). We instead develop novel, more efficient factorization algorithms that directly construct the read-once expression for a result tuple Boolean formula (if one exists), for a large subclass of queries (specifically, conjunctive queries without self-joins). We empirically demonstrate that (1) our proposed techniques are orders of magnitude faster than generic inference algorithms for queries where the result Boolean formulas can be factorized into read-once expressions, and (2) for the special case of hierarchical queries, they rival the efficiency of prior techniques specifically designed to handle such queries.

Lineage Processing on Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2010. [pdf]

In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a two-phase approach where we first ... [more]

PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB Journal Special Issue on Uncertain and Probabilistic Databases, 18(6): 1065-1090, 2009. [pdf]

Due to numerous applications producing noisy data, e.g., sensor data, experimental data, data from uncurated sources, information extraction, etc., there has been a surge of interest in the development of ... [more]

A Unified Approach to Ranking in Probabilistic Databases;
Jian Li and Barna Saha and Amol Deshpande;
VLDB 2009 (also CoRR Technical Report arXiv:0904.1366). [pdf] [talk] Recipient of the best paper award.

The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decision-making over such data. In this paper, we present a ... [more]

Bisimulation-based Approximate Lifted Inference;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
UAI 2009. [pdf]

There has been a great deal of recent interest in methods for performing lifted inference, however most of this work assumes that the first-order model is given as input to the system. Here, we describe lifted inference ... [more]

Consensus Answers for Queries over Probabilistic Databases;
Jian Li and Amol Deshpande;
PODS 2009 (also CoRR Technical Report arXiv:0812.2049v1). [pdf] [talk]

We address the problem of finding a "best" deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) ... [more]

Indexing Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2009. [pdf]

With large amounts of correlated probabilistic data being generated in a wide range of application domains including sensor networks, information extraction, event detection etc., effectively managing and querying them has become an ... [more]

Efficient Query Evaluation over Temporally Correlated Probabilistic Streams;
Bhargav Kanagal, Amol Deshpande;
ICDE 2009 (short paper). [pdf]

Many real world applications such as sensor networks and other monitoring applications naturally generate probabilistic streams that are highly correlated in both time and space. Query processing over such streaming data must be ... [more]

Graphical Models for Uncertain Data;
Amol Deshpande, Lise Getoor, and Prithviraj Sen;
Book Chapter. Managing and Mining Uncertain Data, ed. C. Aggarwal, Springer, 2009.. [pdf]

Graphical models are a popular and well-studied framework for compact representation of a joint probability distribution over a large number of interdependent variables, and for efficient reasoning about such a distribution. ... [more]

Exploiting Shared Correlations in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB 2008. [pdf]

There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources --- from sensor data, experimental data, data from uncurated sources, and many others. There is a ... [more]

Representing and Querying Correlated Tuples in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande;
ICDE 2007. [pdf]

Probabilistic databases have received considerable attention recently due to the need for storing uncertain data produced by many real world applications. The widespread use of probabilistic databases is hampered by two limitations: (1) ... [more]

Representing Tuple and Attribute Uncertainty in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
The 1st Workshop on Data Mining of Uncertain Data (DUNE 2007), in conjunction ICDM 2007.

MauveDB: Supporting Model-based User Views in Database Systems;
Amol Deshpande, Sam Madden;
SIGMOD 2006. [pdf] [talk]

Real-world data --- especially when generated by distributed measurement infrastructures such as sensor networks --- tends to be incomplete, imprecise, and erroneous, making it impossible to present it to users or feed it directly ... [more]

Using Probabilistic Models for Data Management in Acquisitional Environments;
Amol Deshpande, Carlos Guestrin, Sam Madden;
CIDR 2005. [pdf]

Traditional database systems, particularly those focused on capturing and managing data from the real world, are poorly equipped to deal with the noise, loss, and uncertainty in data. We discuss a suite of ... [more]

Managing and Querying Large-scale Uncertain Databases

Project Participants

Publications

Acknowledgments