The goal of this project is to build a complete probabilistic data management
system, called PrDB, that can manage, store, and process large-scale
repositories of uncertain data. PrDB unifies ideas from "large-scale structured
graphical models" like probabilistic relational models (PRMs), developed in the
machine learning literature, and "probabilistic query processing", studied in
the database literature. PrDB framework is based on the notion of "shared
factors", which not only allows us to express and manipulate uncertainties at
various levels of abstractions, but also supports capturing rich correlations
among the uncertain data. PrDB supports a declarative SQL-like language for
specifying uncertain data and the correlations among them. PrDB also supports
exact and approximate evaluation of a wide range of queries including inference
queries, SQL queries, and decision-support queries.
- Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover;
Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik;
SODA 2014 (also CoRR Technical Report arXiv:1303.0726).
Stochastic Boolean Function Evaluation is the problem of determining the value of a given Boolean function f on an unknown input x, when each bit of
x_i of x can only be determined by paying an associated cost c_i. The assumption is that x is drawn from a given product distribution, and the goal
is to minimize the expected cost. This problem has been studied in Operations Research, where it is known as "sequential testing" of Boolean
... [more]
Stochastic Boolean Function Evaluation is the problem of determining the value of a given Boolean function f on an unknown input x, when each bit of
x_i of x can only be determined by paying an associated cost c_i. The assumption is that x is drawn from a given product distribution, and the goal
is to minimize the expected cost. This problem has been studied in Operations Research, where it is known as "sequential testing" of Boolean
functions. It has also been studied in learning theory in the context of learning with attribute costs. We consider the general problem of developing
approximation algorithms for Stochastic Boolean Function Evaluation. We give a 3-approximation algorithm for evaluating Boolean linear threshold
formulas. We also present an approximation algorithm for evaluating CDNF formulas (and decision trees) achieving a factor of O(log kd), where k is
the number of terms in the DNF formula, and d is the number of clauses in the CNF formula. In addition, we present approximation algorithms for
simultaneous evaluation of linear threshold functions, and for ranking of linear functions.
Our function evaluation algorithms are based on reductions to the Stochastic Submodular Set Cover (SSSC) problem. This problem was introduced by
Golovin and Krause. They presented an approximation algorithm for the problem, called Adaptive Greedy. Our main technical contribution is a new
approximation algorithm for the SSSC problem, which we call Adaptive Dual Greedy. It is an extension of the Dual Greedy algorithm for Submodular Set
Cover due to Fujito, which is a generalization of Hochbaum's algorithm for the classical Set Cover Problem. We also give a new bound on the
approximation achieved by the Adaptive Greedy algorithm of Golovin and Krause.
- Local Structure and Determinism in Probabilistic Databases;
Theodoros Rekatsinas, Amol Deshpande, Lise Getoor;
SIGMOD 2012.
[pdf]
While extensive work has been done on evaluating queries under the
tuple-independence assumption, query evaluation over correlated data has
received much less attention even though the support for correlations is
... [more]
While extensive work has been done on evaluating queries under the
tuple-independence assumption, query evaluation over correlated data has
received much less attention even though the support for correlations is
essential for many natural applications of probabilistic databases (e.g.,
information extraction, data integration, computer vision, etc.). In this paper,
we develop a novel approach for efficiently evaluating probabilistic queries
over correlated databases where correlations are represented using a "factor
graph", a class of graphical models widely used for capturing correlations and
performing statistical inference. Our approach exploits the specific values of
the factor parameters and determinism in the correlations, collectively called
"local structure", to reduce the complexity of query evaluation. Our framework
is based on "arithmetic circuits", factorized representations of probability
distributions that can exploit such local structure. Traditionally, arithmetic
circuits are generated following a compilation process and can not be updated
directly. We introduce a generalization of arithmetic circuits, called
"annotated arithmetic circuits", and a novel algorithm for updating them, which
enables us to answer probabilistic queries efficiently. We present a
comprehensive experimental analysis and show speed-ups of at least one order of
magnitude in many cases.
- Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems;
Jian Li, and Amol Deshpande;
FOCS 2011 (also CoRR Technical Report arXiv:1012.3189).
We study the stochastic versions of a broad class of combinatorial problems
where the weights of the elements in the input dataset are uncertain. The class
of problems that we study includes shortest paths, minimum weight spanning
... [more]
We study the stochastic versions of a broad class of combinatorial problems
where the weights of the elements in the input dataset are uncertain. The class
of problems that we study includes shortest paths, minimum weight spanning
trees, and minimum weight matchings over probabilistic graphs, and other
combinatorial problems like knapsack. We observe that the expected value is
inadequate in capturing different types of "risk-averse" or "risk-prone"
behaviors, and instead we consider a more general objective which is to maximize
the "expected utility" of the solution for some given utility function, rather
than the expected weight (expected weight becomes a special case). We show that
we can obtain a polynomial time approximation algorithm with "additive error"
$\epsilon$ for any $\epsilon>0$, if there is a pseudopolynomial time algorithm
for the "exact" version of the problem. (This is true for the problems mentioned
above). Our result generalizes several prior results on stochastic shortest
path, stochastic spanning tree, and stochastic knapsack. Our algorithm for
utility maximization makes use of the separability of exponential utility and a
technique to decompose a general utility function into exponential utility
functions, which may be useful in other stochastic optimization problems.
- Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases;
Bhargav Kanagal, Jian Li, Amol Deshpande;
SIGMOD 2011.
[pdf]
Probabilistic database systems have successfully established themselves as a tool for
managing uncertain data. However, much of the research in this area has
focused on efficient query evaluation and has largely ignored two key
... [more]
Probabilistic database systems have successfully established themselves as a tool for
managing uncertain data. However, much of the research in this area has
focused on efficient query evaluation and has largely ignored two key
issues that commonly arise in uncertain data management:
First, how to provide "explanations" for query
results, e.g., ``Why is this tuple in my result?'' or ``Why does this
output tuple have such high probability?''. Second, the problem of
determining the "sensitive" input tuples for the given query, e.g., users
are interested to know the input tuples that can substantially alter the output,
when their probabilities are modified (since they may be unsure about the input
probability values).
Existing systems provide the "lineage/provenance" of each of the output tuples in
addition to the output probabilities, which is a boolean formula indicating the
dependence of the output tuple on the input tuples. However, lineage does not
immediately provide a quantitative relationship and it is not informative when
we have multiple output tuples.
In this paper, we propose a unified framework
that can handle both the issues mentioned above to facilitate robust query
processing. We formally define the notions of "influence" and "explanations" and provide algorithms to determine the top-$\ell$ influential set of
variables and the top-$\ell$ set of explanations for a variety of queries,
including "conjunctive" queries, "probabilistic threshold" queries,
"top-k" queries and "aggregation" queries.
Further, our framework naturally enables highly efficient incremental
evaluation when input probabilities are modified (e.g., if uncertainty is
resolved).
Our preliminary experimental results demonstrate the benefits of our framework
for performing robust query processing over probabilistic databases.
- A Unified Approach to Ranking in Probabilistic Databases;
Jian Li and Barna Saha and Amol Deshpande;
VLDB Journal, 20(2): 249-275, 2011.
[pdf]
Ranking is a fundamental operation in data analysis and decision support, and
plays an even more crucial role if the dataset being explored exhibits
uncertainty. This has led to much work in understanding how to rank the tuples
... [more]
Ranking is a fundamental operation in data analysis and decision support, and
plays an even more crucial role if the dataset being explored exhibits
uncertainty. This has led to much work in understanding how to rank the tuples
in a probabilistic dataset in recent years. In this article, we present a
unified approach to ranking and top-$k$ query processing in probabilistic
databases by viewing it as a multi-criteria optimization problem, and by
deriving a set of "features" that capture the key properties of a probabilistic
dataset that dictate the ranked result. We contend that a single, specific
ranking function may not suffice for probabilistic databases, and we instead
propose two "parameterized ranking functions", called \PRFs\ and \PRFe, that
generalize or
can approximate many of the previously proposed ranking functions. We present novel
"generating functions"-based algorithms for
efficiently ranking large datasets according to these ranking functions, even if the
datasets exhibit complex correlations modeled using "probabilistic and/xor trees"
or "Markov networks". We further propose that the parameters of the ranking function be
"learned" from user preferences, and we develop an approach to learn those parameters.
Finally, we present a comprehensive experimental study that illustrates the effectiveness of
our parameterized ranking functions, especially \PRFe, at approximating other ranking functions and
the scalability of our proposed algorithms for exact or approximate ranking.
- Ranking Continuous Probabilistic Datasets;
Jian Li, and Amol Deshpande;
VLDB 2010.
[pdf]
Ranking is a fundamental operation in data analysis and decision support, and
plays an even more crucial role if the dataset being explored exhibits
uncertainty. This has led to much work in understanding how to rank uncertain
... [more]
Ranking is a fundamental operation in data analysis and decision support, and
plays an even more crucial role if the dataset being explored exhibits
uncertainty. This has led to much work in understanding how to rank uncertain
datasets in recent years. In this paper, we address the problem of ranking when
the tuple scores are uncertain, and the uncertainty is captured using
"continuous" probability distributions (e.g. Gaussian distributions). We present
a comprehensive solution to compute the values of a "parameterized ranking
function" (PRF) for arbitrary continuous probability distributions (and thus
rank the uncertain dataset); PRF can be used to simulate or approximate many
other ranking functions proposed in prior work. We develop exact polynomial time
algorithms for some continuous probability distribution classes, and efficient
approximation schemes with provable guarantees for arbitrary probability
distributions. Our algorithms can also be used for exact or approximate
evaluation of k-nearest neighbor queries over uncertain objects, whose positions
are modeled using continuous probability distributions. Our experimental
evaluation over several datasets illustrates the effectiveness of our approach
at efficiently ranking uncertain datasets with continuous attribute uncertainty.
- Read-Once Functions and Query Evaluation in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB 2010.
[pdf]
Probabilistic databases hold promise of being a viable means for large-scale
uncertainty management, increasingly needed in a number of real world
applications domains. However, query evaluation in probabilistic databases
... [more]
Probabilistic databases hold promise of being a viable means for large-scale
uncertainty management, increasingly needed in a number of real world
applications domains. However, query evaluation in probabilistic databases
remains a computational challenge. Prior work on efficient "exact" query
evaluation in probabilistic databases has largely concentrated on query-centric
formulations (e.g., "safe plans", "hierarchical queries"), in that, they only
consider characteristics of the query and not the data in the database. It is
easy to construct examples where a supposedly hard query run on an appropriate
database gives rise to a tractable query evaluation problem. In this paper, we
develop efficient query evaluation techniques that leverage characteristics of
both the query and the data in the database. We focus on tuple-independent
databases where the query evaluation problem is equivalent to computing marginal
probabilities of Boolean formulas associated with the result tuples. This latter
task is easy if the Boolean formulas can be factorized into a form that has
every variable appearing at most once (called "read-once"). However, a naive
approach that directly uses previously developed Boolean formula factorization
algorithms is inefficient, because those algorithms require the input formulas
to be in the disjunctive normal form (DNF). We instead develop novel, more
efficient factorization algorithms that directly construct the read-once
expression for a result tuple Boolean formula (if one exists), for a large
subclass of queries (specifically, conjunctive queries without self-joins). We
empirically demonstrate that (1) our proposed techniques are orders of magnitude
faster than generic inference algorithms for queries where the result Boolean
formulas can be factorized into read-once expressions, and (2) for the special
case of hierarchical queries, they rival the efficiency of prior techniques
specifically designed to handle such queries.
- Lineage Processing on Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2010.
[pdf]
In this paper, we address the problem of scalably evaluating conjunctive queries
over correlated probabilistic databases containing tuple or attribute
uncertainties. Like previous work, we adopt a two-phase approach where we first
... [more]
In this paper, we address the problem of scalably evaluating conjunctive queries
over correlated probabilistic databases containing tuple or attribute
uncertainties. Like previous work, we adopt a two-phase approach where we first
compute "lineages" of the output tuples, and then compute the probabilities of
the lineage formulas. However unlike previous work, we allow for arbitrary and
complex correlations to be present in the data, captured via a forest of
"junction trees". We observe that evaluating even read-once (tree structured)
lineages (e.g., those generated by "hierarchical" conjunctive queries),
polynomially computable over tuple independent probabilistic databases, is
\#P-complete for lightly correlated probabilistic databases like "Markov
sequences". We characterize the complexity of exact computation of the
probability of the lineage formula on a correlated database using a parameter
called "lwidth" (analogous to the notion of "treewidth"). For lineages that
result in low lwidth, we compute exact probabilities using a novel message
passing algorithm, and for lineages that induce large lwidths, we develop
approximate Monte Carlo algorithms to estimate the result probabilities. We
scale our algorithms to very large correlated probabilistic databases using the
previously proposed INDSEP data structure. To mitigate the complexity of lineage
evaluation, we develop optimization techniques to process a batch of lineages by
sharing computation across formulas, and to exploit any independence
relationships that may exist in the data. Our experimental study illustrates the
benefits of using our algorithms for processing lineage formulas over correlated
probabilistic databases.
- PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB Journal Special Issue on Uncertain and Probabilistic Databases, 18(6): 1065-1090, 2009.
[pdf]
Due to numerous applications producing noisy data, e.g., sensor data,
experimental data, data from uncurated sources, information extraction,
etc., there has been a surge of interest in the development of
... [more]
Due to numerous applications producing noisy data, e.g., sensor data,
experimental data, data from uncurated sources, information extraction,
etc., there has been a surge of interest in the development of
probabilistic databases. Most probabilistic database models proposed to
date, however, fail to meet the challenges of real-world
applications on two counts: (1) they often restrict the kinds of
uncertainty that the user can represent; and (2) the query processing
algorithms often cannot scale up to the needs of the application. In
this work, we define a probabilistic database model, "PrDB", that
uses graphical models, a state-of-the-art probabilistic modeling
technique developed within the statistics and machine learning
community, to model uncertain data. We show how this results in a rich, complex
yet compact probabilistic database model, which can capture the commonly
occurring uncertainty models (tuple uncertainty, attribute uncertainty),
more complex models (correlated tuples and attributes) and allows
compact representation (shared and schema-level correlations). In addition, we show
how query evaluation in \PrDB\ translates into inference in an appropriately
augmented graphical model. This allows us to easily use any of a myriad of
exact and approximate inference algorithms developed within the
graphical modeling community. While probabilistic inference provides a generic approach
to solving queries, we show how the use of shared correlations, together
with a novel inference algorithm that we developed based on bisimulation,
can speed query processing significantly.
We present a comprehensive experimental evaluation of
the proposed techniques and show that even with a few shared
correlations, significant speedups are possible.
- A Unified Approach to Ranking in Probabilistic Databases;
Jian Li and Barna Saha and Amol Deshpande;
VLDB 2009 (also CoRR Technical Report arXiv:0904.1366).
[pdf] [talk] Recipient of the best paper award.
The dramatic growth in the number of application domains that naturally generate
probabilistic, uncertain data has resulted in a need for efficiently supporting
complex querying and decision-making over such data. In this paper, we present a
... [more]
The dramatic growth in the number of application domains that naturally generate
probabilistic, uncertain data has resulted in a need for efficiently supporting
complex querying and decision-making over such data. In this paper, we present a
unified approach to ranking and top-k query processing in probabilistic
databases by viewing it as a multi-criteria optimization problem, and by
deriving a set of features that capture the key properties of a probabilistic
dataset that dictate the ranked result. We contend that a single, specific
ranking function may not suffice for probabilistic databases, and we instead
propose two parameterized ranking functions, called PRF-w and PRF-e, that
generalize or can approximate many of the previously proposed ranking functions.
We present novel generating functions-based algorithms for efficiently ranking
large datasets according to these ranking functions, even if the datasets
exhibit complex correlations modeled using probabilistic and/xor trees or Markov
networks. We further propose that the parameters of the ranking function be
learned from user preferences, and we develop an approach to learn those
parameters. Finally, we present a comprehensive experimental study that
illustrates the effectiveness of our parameterized ranking functions, especially
PRF-e, at approximating other ranking functions and the scalability of our
proposed algorithms for exact or approximate ranking.
- Bisimulation-based Approximate Lifted Inference;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
UAI 2009.
[pdf]
There has been a great deal of recent interest in methods for performing
lifted inference, however most of this work assumes that the first-order
model is given as input to the system. Here, we describe lifted inference
... [more]
There has been a great deal of recent interest in methods for performing
lifted inference, however most of this work assumes that the first-order
model is given as input to the system. Here, we describe lifted inference
algorithms that determine symmetries and automatically "lift" the
probabilistic model to speedup inference. In particular, we describe
approximate lifted inference techniques that allow the user to trade
off inference accuracy for computational efficiency by using a
handful of tunable parameters, while keeping the error bounded. Our
algorithms are closely related to the graph-theoretic concept of
bisimulation. We report experiments on both synthetic and real data to
show that in the presence of symmetries, run-times for inference can
be improved significantly with approximate lifted inference providing
speedups of upto 2 orders of magnitude over ground inference.
- Consensus Answers for Queries over Probabilistic Databases;
Jian Li and Amol Deshpande;
PODS 2009 (also CoRR Technical Report arXiv:0812.2049v1).
[pdf] [talk]
We address the problem of finding a "best" deterministic query answer to a query
over a probabilistic database. For this purpose, we propose the notion of a
consensus world (or a consensus answer) which is a deterministic world (answer)
... [more]
We address the problem of finding a "best" deterministic query answer to a query
over a probabilistic database. For this purpose, we propose the notion of a
consensus world (or a consensus answer) which is a deterministic world (answer)
that minimizes the expected distance to the possible worlds (answers). This
problem can be seen as a generalization of the well-studied inconsistent
information aggregation problems (e.g. rank aggregation) to probabilistic
databases. We consider this problem for various types of queries including SPJ
queries, Top-k queries, group-by aggregate queries, and clustering. For
different distance metrics, we obtain polynomial time optimal or approximation
algorithms for computing the consensus answers (or prove NP-hardness). Most of
our results are for a general probabilistic database model, called "and/xor tree
model", which significantly generalizes previous probabilistic database models
like x-tuples and block-independent disjoint models, and is of independent
interest.
- Indexing Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2009.
[pdf]
With large amounts of correlated probabilistic data being generated in a wide
range of application domains including sensor networks, information extraction,
event detection etc., effectively managing and querying them has become an
... [more]
With large amounts of correlated probabilistic data being generated in a wide
range of application domains including sensor networks, information extraction,
event detection etc., effectively managing and querying them has become an
important research challenge. While there is an exhaustive body of literature on
querying independent probabilistic data, supporting efficient queries over
large-scale, correlated databases remains a challenge. In this paper, we develop
efficient data structures and indexes for supporting inference and decision
support queries over such databases. Our proposed
hierarchical data structure is suitable both for in-memory and disk-resident
databases. We represent the correlations in the probabilistic database using
a "junction tree" over the tuple-existence or attribute-value random
variables,
and use "tree partitioning" techniques to build an index structure over it.
We show how to efficiently answer inference and aggregation queries using such
an index, resulting in orders of magnitude performance benefits in most cases.
In addition, we develop novel algorithms for efficiently keeping the index
structure up-to-date as changes (inserts, updates) are made to the probabilistic database.
We present a comprehensive experimental study illustrating the benefits of
our approach to query processing in probabilistic databases.
- Efficient Query Evaluation over Temporally Correlated Probabilistic Streams;
Bhargav Kanagal, Amol Deshpande;
ICDE 2009 (short paper).
[pdf]
Many real world applications such as sensor networks and other monitoring
applications naturally generate probabilistic streams that are highly correlated
in both time and space. Query processing over such streaming data must be
... [more]
Many real world applications such as sensor networks and other monitoring
applications naturally generate probabilistic streams that are highly correlated
in both time and space. Query processing over such streaming data must be
cognizant of these correlations, since they can significantly alter the final
query results. Several prior works have suggested approaches to handling
correlations in probabilistic databases. However those approaches are either
unable to represent the types of correlations that probabilistic streams
exhibit, or can not be applied directly to our problem because of their
complexity. In this paper, we develop a system for managing and querying such
streams by exploiting the fact that most real-world probabilistic streams exhibit
highly structured Markovian correlations. Our approach is based on the
previously proposed framework of viewing probabilistic query evaluation as
inference over graphical models; we show how to efficiently construct graphical
models for the common stream processing operators, and how to efficiently
perform inference over them in an incremental fashion. Our extensive
experimental evaluation illustrates the advantages of exploiting the structured
nature of correlations in probabilistic streams.
- Graphical Models for Uncertain Data;
Amol Deshpande, Lise Getoor, and Prithviraj Sen;
Book Chapter. Managing and Mining Uncertain Data, ed. C. Aggarwal, Springer, 2009..
[pdf]
Graphical models are a popular and well-studied framework for compact
representation of a joint probability distribution over a large number of
interdependent variables, and for efficient reasoning about such a distribution.
... [more]
Graphical models are a popular and well-studied framework for compact
representation of a joint probability distribution over a large number of
interdependent variables, and for efficient reasoning about such a distribution.
They have been proven useful in a wide range of domains from natural language
processing to computer vision to bioinformatics. In this chapter, we present an
approach to using graphical models for managing and querying large-scale
uncertain databases. We present a unified framework based on the concepts from
graphical models that can model not only tuple-level and attribute-level
uncertainties, but can also handle arbitrary correlations that may be present
among the data; our framework can also naturally capture "shared correlations"
where the same uncertainties and correlations occur repeatedly in the data. We
develop an efficient strategy for query evaluation over such probabilistic
databases by casting the query processing problem as an "inference" problem in
an appropriately constructed graphical model, and present optimizations specific
to probabilistic databases that enable efficient query evaluation. We conclude
the chapter with a discussion of related and future work on these topics.
- Exploiting Shared Correlations in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB 2008.
[pdf]
There has been a recent surge in work in probabilistic databases, propelled in
large part by the huge increase in noisy data sources --- from sensor data,
experimental data, data from uncurated sources, and many others. There is a
... [more]
There has been a recent surge in work in probabilistic databases, propelled in
large part by the huge increase in noisy data sources --- from sensor data,
experimental data, data from uncurated sources, and many others. There is a
growing need for database management systems that can efficiently represent and
query such data. In this work, we show how data characteristics can be leveraged
to make the query evaluation process more efficient. In particular, we exploit
what we refer to as "shared correlations" where the same uncertainties and
correlations occur repeatedly in the data. Shared correlations occur mainly due
to two reasons: (1) Uncertainty and correlations usually come from general
statistics and rarely vary on a tuple-to-tuple basis; (2) The query evaluation
procedure itself tends to re-introduce the same correlations. Prior work has
shown that the query evaluation problem on probabilistic databases is equivalent
to a probabilistic inference problem on an appropriately constructed
probabilistic graphical model (PGM). We leverage this by introducing a new data
structure, called the "random variable elimination graph" (rv-elim graph) that
can be built from the PGM obtained from query evaluation. We develop techniques
based on bisimulation that can be used to compress the rv-elim graph exploiting
the presence of shared correlations in the PGM, the compressed rv-elim graph can
then be used to run inference. We validate our methods by evaluating them
empirically and show that even with a few shared correlations significant
speed-ups are possible.
- Representing and Querying Correlated Tuples in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande;
ICDE 2007.
[pdf]
Probabilistic databases have received considerable attention recently due to the
need for storing uncertain data produced by many real world applications. The
widespread use of probabilistic databases is hampered by two limitations: (1)
... [more]
Probabilistic databases have received considerable attention recently due to the
need for storing uncertain data produced by many real world applications. The
widespread use of probabilistic databases is hampered by two limitations: (1)
current probabilistic databases make simplistic assumptions about the data
(e.g., complete independence among tuples) that make it difficult to use them in
applications that naturally produce correlated data, and (2) most probabilistic
databases can only answer a restricted subset of the queries that can be
expressed using traditional query languages. We address both these limitations
by proposing a framework that can represent not only probabilistic tuples, but
also correlations that may be present among them. Our proposed framework
naturally lends itself to the possible world semantics thus preserving the
precise query semantics extant in current probabilistic databases. We develop an
efficient strategy for query evaluation over such probabilistic databases by
casting the query processing problem as an "inference" problem in an
appropriately constructed "probabilistic graphical model". We present several
optimizations specific to probabilistic databases that enable efficient query
evaluation. We validate our approach by presenting an experimental evaluation
that illustrates the effectiveness of our techniques at answering various
queries using real and synthetic datasets.
- Representing Tuple and Attribute Uncertainty in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
The 1st Workshop on Data Mining of Uncertain Data (DUNE 2007), in conjunction ICDM 2007.
- MauveDB: Supporting Model-based User Views in Database Systems;
Amol Deshpande, Sam Madden;
SIGMOD 2006.
[pdf] [talk]
Real-world data --- especially when generated by distributed measurement
infrastructures such as sensor networks --- tends to be incomplete, imprecise,
and erroneous, making it impossible to present it to users or feed it directly
... [more]
Real-world data --- especially when generated by distributed measurement
infrastructures such as sensor networks --- tends to be incomplete, imprecise,
and erroneous, making it impossible to present it to users or feed it directly
into applications. The traditional approach to dealing with this problem is to
first process the data using statistical or probabilistic "models" that can
provide more robust interpretations of the data. Current database systems,
however, do not provide adequate support for applying models to such data,
especially when those models need to be frequently updated as new data arrives
in the system. Hence, most scientists and engineers, who depend on models for
managing their data, do not use database systems for archival or querying at
all; at best, databases serve as a persistent raw data store.
In this
paper we define a new abstraction called "model-based views" and present the
architecture of "MauveDB", the system we are building to support such views.
Just as traditional database views provide logical data independence,
model-based views provide independence from the details of the underlying data
generating mechanism and hide the irregularities of the data by using models to
present a consistent view to the users. MauveDB supports a declarative language
for defining model-based views, allows declarative querying over such views
using SQL, and supports several different materialization strategies and
techniques to efficiently maintain them in the face of frequent updates. We have
implemented a prototype system that currently supports views based on regression
and interpolation, in the Apache Derby open source DBMS, and we present results
that show the utility and performance benefits that can be obtained by
supporting several different types of model-based views in a database system.
- Using Probabilistic Models for Data Management in Acquisitional Environments;
Amol Deshpande, Carlos Guestrin, Sam Madden;
CIDR 2005.
[pdf]
Traditional database systems, particularly those focused on capturing
and managing data from the real world, are poorly equipped to deal
with the noise, loss, and uncertainty in data. We discuss a suite of
... [more]
Traditional database systems, particularly those focused on capturing
and managing data from the real world, are poorly equipped to deal
with the noise, loss, and uncertainty in data. We discuss a suite of
techniques based on probabilistic models that are designed to allow
database to tolerate noise and loss. These techniques are based on
exploiting correlations to predict missing values and identify outliers.
Interestingly, correlations also provide a way to give
approximate answers to users at a significantly lower cost and enable
a range of new types of queries over the correlation structure itself.
We illustrate a host of applications for our new techniques and
queries, ranging from sensor networks to network monitoring to data
stream management. We also present a unified architecture for
integrating such models into database systems, focusing in particular
on "acquisitional systems" where the cost of capturing data (e.g.,
from sensors) is itself a significant part of the query processing
cost.