The MauveDB project is motivated by the tremendous increase in
the number of distributed measurement infrastructures such as
wireless sensor networks that continuously generate invaluable
data about our everyday world. However, the potential of this
data has been hard to realize mainly because of the typically
incomplete, imprecise, erroneous, and uncertain nature of the
data generated. The MauveDB project aims to develop abstractions
that make it easy for users and application developers to
continuously apply statistical modeling tools to streaming
sensor data. Such statistical
models can be used for data cleaning, prediction, interpolation, anomaly
detection and for inferring hidden variables from the data, thus addressing
many of the challenges in managing sensor data.
MauveDB supports a new abstraction called "model-based views" to achieve the
above goal. A model-based view is analogous to a traditional database view
in that it can be used to present a consistent "view" of the underlying data
to the user. However, as opposed to a traditional database view, a model-based view
is defined using a statistical model instead of an SQL query. This not only
significantly enriches the user interaction with the sensed data, but also
results in more efficient processing of data.
A brief overview of the MauveDB project and the underlying technology
- Declarative Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Galileo Namata, Amol Deshpande, Lise Getoor;
ICDE Workshop on Graph Data Management: Techniques and Applications (GDM 2011).
There is a growing interest in methods for analyzing data describing networks of
all types, including information, biological, physical, and social networks.
Typically the data describing these networks is observational, and thus noisy
... [more]
There is a growing interest in methods for analyzing data describing networks of
all types, including information, biological, physical, and social networks.
Typically the data describing these networks is observational, and thus noisy
and incomplete; it is often at the wrong level of fidelity and abstraction for
meaningful data analysis. This has resulted in a growing body of work on
extracting, cleaning, and annotating network data. Unfortunately, much of this
work is ad hoc and domain-specific. In this paper, we present the architecture
of a data management system that enables efficient, declarative analysis of
large-scale information networks. We identify a set of primitives to support the
extraction and inference of a network from observational data, and describe a
framework that enables a network analyst to easily implement and combine new
extraction and analysis techniques, and efficiently apply them to large
observation networks. The key insight behind our approach is to "decouple", to
the extent possible, (a) the operations that require traversing the graph
structure (typically the computationally expensive step), from (b) the
operations that do the modification and update of the extracted network. We
present an analysis language based on "Datalog", and show how to use it to
cleanly achieve such decoupling. We briefly describe our prototype system that
supports these abstractions. We include a preliminary performance evaluation of
the system and show that our approach scales well and can efficiently handle a
wide spectrum of data cleaning operations on network data.
- Lineage Processing on Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2010.
[pdf]
In this paper, we address the problem of scalably evaluating conjunctive queries
over correlated probabilistic databases containing tuple or attribute
uncertainties. Like previous work, we adopt a two-phase approach where we first
... [more]
In this paper, we address the problem of scalably evaluating conjunctive queries
over correlated probabilistic databases containing tuple or attribute
uncertainties. Like previous work, we adopt a two-phase approach where we first
compute "lineages" of the output tuples, and then compute the probabilities of
the lineage formulas. However unlike previous work, we allow for arbitrary and
complex correlations to be present in the data, captured via a forest of
"junction trees". We observe that evaluating even read-once (tree structured)
lineages (e.g., those generated by "hierarchical" conjunctive queries),
polynomially computable over tuple independent probabilistic databases, is
\#P-complete for lightly correlated probabilistic databases like "Markov
sequences". We characterize the complexity of exact computation of the
probability of the lineage formula on a correlated database using a parameter
called "lwidth" (analogous to the notion of "treewidth"). For lineages that
result in low lwidth, we compute exact probabilities using a novel message
passing algorithm, and for lineages that induce large lwidths, we develop
approximate Monte Carlo algorithms to estimate the result probabilities. We
scale our algorithms to very large correlated probabilistic databases using the
previously proposed INDSEP data structure. To mitigate the complexity of lineage
evaluation, we develop optimization techniques to process a batch of lineages by
sharing computation across formulas, and to exploit any independence
relationships that may exist in the data. Our experimental study illustrates the
benefits of using our algorithms for processing lineage formulas over correlated
probabilistic databases.
- Indexing Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2009.
[pdf]
With large amounts of correlated probabilistic data being generated in a wide
range of application domains including sensor networks, information extraction,
event detection etc., effectively managing and querying them has become an
... [more]
With large amounts of correlated probabilistic data being generated in a wide
range of application domains including sensor networks, information extraction,
event detection etc., effectively managing and querying them has become an
important research challenge. While there is an exhaustive body of literature on
querying independent probabilistic data, supporting efficient queries over
large-scale, correlated databases remains a challenge. In this paper, we develop
efficient data structures and indexes for supporting inference and decision
support queries over such databases. Our proposed
hierarchical data structure is suitable both for in-memory and disk-resident
databases. We represent the correlations in the probabilistic database using
a "junction tree" over the tuple-existence or attribute-value random
variables,
and use "tree partitioning" techniques to build an index structure over it.
We show how to efficiently answer inference and aggregation queries using such
an index, resulting in orders of magnitude performance benefits in most cases.
In addition, we develop novel algorithms for efficiently keeping the index
structure up-to-date as changes (inserts, updates) are made to the probabilistic database.
We present a comprehensive experimental study illustrating the benefits of
our approach to query processing in probabilistic databases.
- Efficient Query Evaluation over Temporally Correlated Probabilistic Streams;
Bhargav Kanagal, Amol Deshpande;
ICDE 2009 (short paper).
[pdf]
Many real world applications such as sensor networks and other monitoring
applications naturally generate probabilistic streams that are highly correlated
in both time and space. Query processing over such streaming data must be
... [more]
Many real world applications such as sensor networks and other monitoring
applications naturally generate probabilistic streams that are highly correlated
in both time and space. Query processing over such streaming data must be
cognizant of these correlations, since they can significantly alter the final
query results. Several prior works have suggested approaches to handling
correlations in probabilistic databases. However those approaches are either
unable to represent the types of correlations that probabilistic streams
exhibit, or can not be applied directly to our problem because of their
complexity. In this paper, we develop a system for managing and querying such
streams by exploiting the fact that most real-world probabilistic streams exhibit
highly structured Markovian correlations. Our approach is based on the
previously proposed framework of viewing probabilistic query evaluation as
inference over graphical models; we show how to efficiently construct graphical
models for the common stream processing operators, and how to efficiently
perform inference over them in an incremental fashion. Our extensive
experimental evaluation illustrates the advantages of exploiting the structured
nature of correlations in probabilistic streams.
- Online Filtering, Smoothing and Probabilistic Modeling of Streaming data;
Bhargav Kanagal, Amol Deshpande;
ICDE 2008.
[pdf] (Extended version)
In this paper, we address the problem of extending a relational database system
to facilitate efficient real-time application of dynamic probabilistic models to
streaming data. We use the recently proposed abstraction of model-based views
... [more]
In this paper, we address the problem of extending a relational database system
to facilitate efficient real-time application of dynamic probabilistic models to
streaming data. We use the recently proposed abstraction of model-based views
for this purpose, by allowing users to declaratively specify the model to be
applied, and by presenting the output of the models to the user as a
probabilistic database view. We support declarative querying over such views
using an extended version of SQL that allows for querying probabilistic data.
Underneath we use particle filters, a class of sequential Monte Carlo algorithms
commonly used to implement dynamic probabilistic models, to represent the
present and historical states of the model as sets of weighted samples
(particles) that are kept up-to-date as new readings arrive. We develop novel
techniques to convert the queries on the model-based view directly into queries
over particle tables, enabling highly efficient query processing. Finally, we
present exmodel to be applied, and by presenting the output of the models to the
user as a probabilistic database view. We support declarative querying over such
views using an extended version of SQL that allows for querying probabilistic
data. Underneath we use particle filters, a class of sequential Monte C
- Model-based Querying in Sensor Networks;
Amol Deshpande, Carlos Guestrin, Samuel Madden;
Chapter in Encyclopedia of Database Systems. Ling Liu and M. Tamer Ozsu, ed. 2009..
[pdf]
- MauveDB: Supporting Model-based User Views in Database Systems;
Amol Deshpande, Sam Madden;
SIGMOD 2006.
[pdf] [talk]
Real-world data --- especially when generated by distributed measurement
infrastructures such as sensor networks --- tends to be incomplete, imprecise,
and erroneous, making it impossible to present it to users or feed it directly
... [more]
Real-world data --- especially when generated by distributed measurement
infrastructures such as sensor networks --- tends to be incomplete, imprecise,
and erroneous, making it impossible to present it to users or feed it directly
into applications. The traditional approach to dealing with this problem is to
first process the data using statistical or probabilistic "models" that can
provide more robust interpretations of the data. Current database systems,
however, do not provide adequate support for applying models to such data,
especially when those models need to be frequently updated as new data arrives
in the system. Hence, most scientists and engineers, who depend on models for
managing their data, do not use database systems for archival or querying at
all; at best, databases serve as a persistent raw data store.
In this
paper we define a new abstraction called "model-based views" and present the
architecture of "MauveDB", the system we are building to support such views.
Just as traditional database views provide logical data independence,
model-based views provide independence from the details of the underlying data
generating mechanism and hide the irregularities of the data by using models to
present a consistent view to the users. MauveDB supports a declarative language
for defining model-based views, allows declarative querying over such views
using SQL, and supports several different materialization strategies and
techniques to efficiently maintain them in the face of frequent updates. We have
implemented a prototype system that currently supports views based on regression
and interpolation, in the Apache Derby open source DBMS, and we present results
that show the utility and performance benefits that can be obtained by
supporting several different types of model-based views in a database system.
- Model-based Approximate Querying in Sensor Networks;
Amol Deshpande, Carlos Guestrin, Sam Madden, Joseph M. Hellerstein, Wei Hong;
International Journal on Very Large Data Bases (VLDB Journal), 2005.
[pdf]
Declarative queries are proving to be an attractive paradigm for interacting
with networks of wireless sensors. The metaphor that "the sensornet is a
database" is problematic, however, because sensors do not exhaustively represent
... [more]
Declarative queries are proving to be an attractive paradigm for interacting
with networks of wireless sensors. The metaphor that "the sensornet is a
database" is problematic, however, because sensors do not exhaustively represent
the data in the real world. In order to map the raw sensor readings onto
physical reality, a "model" of that reality is required to complement the
readings. In this article, we enrich interactive sensor querying with
statistical modeling techniques. We demonstrate that such models can help
provide answers that are both more meaningful, and, by introducing
approximations with probabilistic confidences, significantly more efficient to
compute in both time and energy. Utilizing the combination of a model and live
data acquisition raises the challenging optimization problem of selecting the
best sensor readings to acquire, balancing the increase in the confidence of our
answer against the communication
and data acquisition costs in the network. We describe an exponential time algorithm for finding the optimal solution
to this optimization problem, and a polynomial-time heuristic for identifying solutions that perform well in practice.
We evaluate our approach on several real-world sensor-network data sets, taking into account the real measured data and
communication quality, demonstrating that our model-based approach provides a high-fidelity representation of the real
phenomena and leads to significant performance gains versus traditional data acquisition techniques.
- Using Probabilistic Models for Data Management in Acquisitional Environments;
Amol Deshpande, Carlos Guestrin, Sam Madden;
CIDR 2005.
[pdf]
Traditional database systems, particularly those focused on capturing
and managing data from the real world, are poorly equipped to deal
with the noise, loss, and uncertainty in data. We discuss a suite of
... [more]
Traditional database systems, particularly those focused on capturing
and managing data from the real world, are poorly equipped to deal
with the noise, loss, and uncertainty in data. We discuss a suite of
techniques based on probabilistic models that are designed to allow
database to tolerate noise and loss. These techniques are based on
exploiting correlations to predict missing values and identify outliers.
Interestingly, correlations also provide a way to give
approximate answers to users at a significantly lower cost and enable
a range of new types of queries over the correlation structure itself.
We illustrate a host of applications for our new techniques and
queries, ranging from sensor networks to network monitoring to data
stream management. We also present a unified architecture for
integrating such models into database systems, focusing in particular
on "acquisitional systems" where the cost of capturing data (e.g.,
from sensors) is itself a significant part of the query processing
cost.
- Model-Driven Data Acquisition in Sensor Networks;
Amol Deshpande, Carlos Guestrin, Sam Madden, Joseph M. Hellerstein, Wei Hong;
VLDB 2004.
[pdf] Recipient of the best paper award.
Declarative queries are proving to be an attractive paradigm for interacting
with networks of wireless sensors. The metaphor that "the sensornet is a
database" is problematic, however, because sensors do not exhaustively represent
... [more]
Declarative queries are proving to be an attractive paradigm for interacting
with networks of wireless sensors. The metaphor that "the sensornet is a
database" is problematic, however, because sensors do not exhaustively represent
the data in the real world. In order to map the raw sensor readings onto
physical reality, a "model" of that reality is required to complement the
readings. In this paper, we enrich interactive sensor querying with statistical
modeling techniques. We demonstrate that such models can help provide
answers that are both more meaningful, and, by introducing approximations
with probabilistic confidences,
significantly more efficient to compute in both time and energy. Utilizing the combination of
a model and live data acquisition raises the challenging optimization
problem of selecting the best sensor readings to acquire, balancing the increase
in the confidence of our answer against the communication and data acquisition costs in the network.
We describe an exponential
time algorithm for finding the optimal solution to this optimization problem, and a
polynomial-time heuristic for identifying solutions that
perform well in practice. We evaluate our approach on several
real-world sensor-network data sets, taking into account the real measured data and communication quality, demonstrating that our model-based approach
provides a high-fidelity representation of the real phenomena and leads to significant performance gains versus traditional data acquisition
techniques.
- Efficient Stepwise Selection in Decomposable Models;
Amol Deshpande, Minos Garofalakis, Mike Jordan;
UAI 2001.
[pdf] [talk]
In this paper, we present an efficient algorithm for performing stepwise
selection in the class of decomposable models. We focus on the forward selection
procedure, but we also discuss how backward selection and the combination of the
... [more]
In this paper, we present an efficient algorithm for performing stepwise
selection in the class of decomposable models. We focus on the forward selection
procedure, but we also discuss how backward selection and the combination of the
two can be performed efficiently. The main contributions of this paper are (1) a
simple characterization for the edges that can be added to a decomposable model
while retaining its decomposability and (2) an efficient algorithm for
enumerating all such edges for a given decomposable model in O(n^2) time, where
n is the number of variables in the model. We also analyze the complexity of the
overall stepwise selection procedure (which includes the complexity of
enumerating eligible edges as well as the complexity of deciding how to
"progress". We use the KL divergence of the model from the saturated model as
our metric, but the results we present here extend to many other metrics as