- ModelHub: Towards Unified Data and Lifecycle Management for Deep Learning;
Hui Miao, Ang Li, Larry Davis, Amol Deshpande;
CoRR Technical Report arXiv:1611.06224, 2016.
- Approximation algorithms for stochastic submodular set cover with applications to Boolean function evaluation and min-knapsack;
Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik;
ACM Transacations on Algorithms 12:3 (42), 2016.
- SourceSight: Enabling Effective Source Selection (Demonstration Proposal);
Theodoros Rekatsinas, Amol Deshpande, Xin Luna Dong, Lise Getoor, Divesh Srivastava;
SIGMOD 2016.
[abstract]
Recently there has been a rapid increase in the number of data sources and data services, such as cloud-based data markets and data portals, that facilitate the collection, publishing and trading of data. Data sources typically exhibit large heterogeneity in the type and quality of data they provide. Unfortunately, when the number of data sources is large, it is difficult for users to reason about the actual usefulness of sources for their applications and the trade-offs between the benefits and costs of acquiring and integrating sources. In this demonstration we present SourceSight, a system that allows users to interactively explore a large number of heterogeneous data sources, and discover valuable sets of sources for diverse integration tasks. SourceSight uses a novel multi-level source quality index that enables effective source selection at different granularity levels, and introduces a collection of new techniques to discover and evaluate relevant sources for integration.
- Continuous Detection of Activity-based Subgraph Patterns on Dynamic Graphs;
Jayanta Mondal and Amol Deshpande;
DEBS 2016.
[abstract]
The ability to detect and analyze interesting subgraph patterns on large and dynamic graph-structured data in real time is crucial for many applications. Such patterns often need to reason about how the nodes behave (i.e., the activity component) in the network as well as how they are connected to each other (i.e., the structural component). An example of such an activity-based subgraph pattern is a clique of users in a social network (the structural predicate), who each have posted more than 10 messages in last 2 hours (the activity-based predicate). Detecting such complex events and analyzing them in real time can help with anomaly detection in phone call networks, advertisement targeting in social networks, malware detection in file download graphs, and so on. In this paper, we present CASQD, a system for continuous detection and analysis of such active subgraph pattern queries over large dynamic graphs. Some of key challenges in executing such queries include: handling a wide variety of user-specified activities of interest, low selectivities of activity-based predicates and the resultant exponential search space, and high data rates often seen in these application domains. A key abstraction in CASQD is a notion called graph-view, which acts as an independence layer between the query language and the underlying physical representation of the graph and the active attributes. This abstraction is aimed at simplifying the query language, while empowering the query optimizer. Considering the balance between expressibility (i.e., patterns that cover many real-world use cases) and optimizability of such patterns, we primarily focus on efficient continuous detection of the active regular structures (specifically, active cliques, active stars, and active bi-cliques). We develop a series of optimization techniques including model-based neighborhood explorations, lazy evaluation of the activity predicates, neighborhood-based search space pruning, and others, for efficient query evaluation. We perform a thorough comparative study of the execution strategies under various settings, and show that our system is capable of achieving end-to-end throughputs over 800k/s using a single, powerful machine.
- NScaleSpark: Subgraph-centric Graph Analytics on Apache Spark;
Abdul Quamar and Amol Deshpande;
NDA 2016 (International Workshop on Network Data Analytics, co-located with SIGMOD 2016).
[abstract]
In this paper, we describe NScaleSpark, a framework for executing large-scale distributed graph analysis tasks on the Apache Spark platform. NScaleSpark is motivated by the increasing interest in executing rich and complex analysis tasks over large graph datasets. There is much recent work on vertex-centric graph programming frameworks for executing such analysis tasks - these systems espouse a "think-like-a-vertex" (TLV) paradigm, with some example systems being Pregel, Apache Giraph, GPS, Grace, and GraphX (built on top of Apache Spark). However, the TLV paradigm is not suitable for many complex graph analysis tasks that typically require processing of information aggregated over neighborhoods or subgraphs in the underlying graph. Instead, NScaleSpark is based on a "think-like-a-subgraph" paradigm (also recently called "think-like-an-embedding"). Here, the users specify computations to be executed against a large number of multi-hop neighborhoods or subgraphs of the data graph. NScaleSpark builds upon our prior work on the NScale system, which was built on top of the Hadoop MapReduce system. This paper describes how we reimplemented NScale on the Apache Spark platform, and the key challenges therein, and the design decisions we made. NScaleSpark uses a series of RDD transformations to extract and hold the relevant subgraphs in distributed memory with minimal footprint using a cost-based optimizer. Our in-memory graph data structure enables efficient graph computations over large-scale graphs. Our experimental results over several real world data sets and applications show orders-of-magnitude improvement in performance and total cost over GraphX and other vertex-centric approaches.
- Decibel: The Relational Dataset Branching System;
Michael Maddox, David Goehring, Aaron Elmore, Samuel Madden, Aditya Parameswaran, Amol Deshpande;
VLDB 2016.
[abstract]
As scientific endeavors and data analysis becomes increasingly collaborative, there is a need for data management systems that natively support the versioning or branching of datasets to enable concurrent analysis, cleaning, integration, manipulation, or curation of data across teams of individuals. Common practice for sharing and collaborating on datasets involves creating or storing multiple copies of the dataset, one for each stage of analysis, with no provenance information tracking the relationships between these datasets. This results not only in wasted storage, but also makes it challenging to track and integrate modifications made by different users to the same dataset. In this paper, we introduce the Relational Dataset Branching System, Decibel, a new relational storage system with built-in version control designed to address these shortcomings. We present our initial design for Decibel and provide a thorough evaluation of three versioned storage engine designs that focus on efficient query processing with minimal storage overhead. We also develop an exhaustive benchmark to enable the rigorous testing of these and future versioned storage engine designs.
- Storing and Analyzing Historical Graph Data at Scale;
Udayan Khurana, and Amol Deshpande;
EDBT 2016 (also CoRR Technical Report arXiv:1509.08960).
[abstract]
The work on large-scale graph analytics to date has largely focused on the study of static properties of graph snapshots. However, a static view of interactions between entities is often an oversimplification of several complex phenomena like the spread of epidemics, information diffusion, formation of online communities}, and so on. Being able to find temporal interaction patterns, visualize the evolution of graph properties, or even simply compare them across time, adds significant value in reasoning over graphs. However, because of lack of underlying data management support, an analyst today has to manually navigate the added temporal complexity of dealing with large evolving graphs. In this paper, we present a system, called Historical Graph Store, that enables users to store large volumes of historical graph data and to express and run complex temporal graph analytical tasks against that data. It consists of two key components: a Temporal Graph Index (TGI), that compactly stores large volumes of historical graph evolution data in a partitioned and distributed fashion; it provides support for retrieving snapshots of the graph as of any timepoint in the past or evolution histories of individual nodes or neighborhoods; and a Spark-based Temporal Graph Analysis Framework (TAF), for expressing complex temporal analytical tasks and for executing them in an efficient and scalable manner. Our experiments demonstrate our system's efficient storage, retrieval and analytics across a wide variety of queries on large volumes of historical graph data.
- NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
To appear in VLDB Journal (also CoRR Technical Report arXiv:1405.1499).
[pdf]
[abstract]
There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego network analysis, motif counting, finding social circles, personalized recommendations, analyzing influence cascades, etc. These tasks are not well served by the existing vertex-centric graph processing frameworks, whose computation, execution models limit the user program to directly access the state of a single vertex; this results in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks typically ignore the challenges in extracting the relevant portion of the graph that an analysis task needs, and loading it onto distributed memory. In this paper, we describe NScale, a novel end-to-end graph processing framework that enables the distributed execution of complex subgraph-centric analytics over large-scale graphs in the cloud. NScale enables users to write programs at the level of subgraphs, and to specify the subgraphs of interest declaratively. NScale uses Apache YARN, a state-of-the-art framework for efficient and fault-tolerant distribution of data and computation. It features GEL, a novel graph extraction and loading phase, that extracts the relevant portions of the graph and utilizes a cost-based optimizer to partition and load the graph onto distributed memory using as few machines as possible to minimize the communication cost. It utilizes novel techniques for the distributed execution of user computation that minimize memory consumption by exploiting overlap among the subgraphs of interest. Our experimental results show orders-of-magnitude improvements in performance, and drastic reductions in the cost of analytics, over vertex-centric approaches.
- Towards a unified query language for provenance and versioning;
Amit Chavan, Silu Huang, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran;
USENIX TAPP 2015.
[pdf]
[abstract]
Organizations and teams collect and acquire data from various sources, such as social interactions, financial transactions, sensor data, and genome sequencers. Different teams in an organization as well as different data scientists within a team are interested in extracting a variety of insights which require combining and collaboratively analyzing datasets in diverse ways. DataHub is a system that aims to provide robust version control and provenance management for such a scenario. To be truly useful for collaborative data science, one also needs the ability to specify queries and analysis tasks over the versioning and the provenance information in a unified manner. In this paper, we present an initial design of our query language, called VQuel, that aims to support such unified querying over both types of information, as well as the intermediate and final results of analyses. We also discuss some of the key language design and implementation challenges moving forward.
- GraphGen: Exploring Interesting Graphs in Relational Data (Demonstration Proposal);
Konstantinos Xirogiannopoulos, Udayan Khurana, and Amol Deshpande;
VLDB 2015.
[abstract]
Analyzing interconnection structures among the data through the use of "graph algorithms" and "graph analytics" has been shown to provide tremendous value in many application domains. However, graphs are not the primary choice in the way that most data is currently stored, and users who want to employ graph analytics are forced to extract data from their data stores, construct the requisite graphs, and then use a specialized engine to write and execute their graph analysis tasks. This cumbersome and costly process not only raises barriers in using graph analytics, but also makes it hard to "explore" and "identify" hidden or implicit graphs in the data. Here we demonstrate a system, called "GraphGen", that enables users to declaratively specify graph extraction tasks over relational databases, to visually explore the extracted graphs, and to write and execute graph algorithms over them. We also demonstrate how unifying the extraction tasks and the graph algorithms enables significant optimizations that would not be possible otherwise.
- Collaborative Data Analytics with DataHub (Demonstration Proposal);
Anant Bhardwaj, Amol Deshpande, Aaron Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, Rebecca Zhang;
VLDB 2015.
[abstract]
While there have been many solutions proposed for storing and analyzing large volumes of data, all of these solutions have limited support for collaborative data analytics, especially given the many individuals and teams are simultaneously analyzing, modifying and exchanging datasets, employing a number of heterogeneous tools or languages for data analysis, and writing scripts to clean, preprocess, or query data. We demonstrate DataHub, a unified platform with the ability to load, store, query, collaboratively analyze, interactively visualize, interface with external applications, and share datasets. We will demonstrate the following aspects of the DataHub platform: (a) flexible data storage, sharing, and native version- ing capabilities: multiple conference attendees can concurrently update the database and browse the different versions and inspect conflicts; (b) an app ecosystem that hosts apps for various data- processing activities: conference attendees will be able to effortlessly ingest, query, and visualize data using our existing apps; (c) thrift-based data serialization permits data analysis in any combination of 20+ languages, with DataHub as the common data store: conference attendees will be able to analyze datasets in R, Python, and Matlab, while the inputs and the results are still stored in DataHub. In particular, conference attendees will be able to use the DataHub notebook — an IPython-based notebook for analyzing data and storing the results of data analysis.
- Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff;
Souvik Bhattacherjee, Amit Chavan, Silu Huang, Amol Deshpande, Aditya Parameswaran;
PVLDB 2015.
[pdf]
[summary blog post]
[abstract]
The relative ease of collaborative data science and analysis has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains, acquired or constructed at various stages of data analysis across many users, and often over long periods of time. Managing, storing, and recreating these dataset versions is a non-trivial task. The fundamental challenge here is the storage−recreation trade−off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this paper, we study this trade-off in a principled manner: we formulate six problems under various settings, trading off these quantities in various ways, demonstrate that most of the problems are intractable, and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DATAHUB system for facilitating collaborative data science. We demonstrate, via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios.
- CrowdGather: Entity Extraction over Structured Domains;
Theodoros Rekatsinas, Amol Deshpande, Aditya Parameswaran;
CoRR Technical Report arXiv:1502.06823, 2015.
[abstract]
Crowdsourced entity extraction is often used to acquire data for many applications, including recommendation systems, construction of aggregated listings and directories, and knowledge base construction. Current solutions focus on entity extraction using a single query, e.g., only using "give me another restaurant", when assembling a list of all restaurants. Due to the cost of human labor, solutions that focus on a single query can be highly impractical. In this paper, we leverage the fact that entity extraction often focuses on {m structured domains}, i.e., domains that are described by a collection of attributes, each potentially exhibiting hierarchical structure. Given such a domain, we enable a richer space of queries, e.g., "give me another Moroccan restaurant in Manhattan that does takeout". Naturally, enabling a richer space of queries comes with a host of issues, especially since many queries return empty answers. We develop new statistical tools that enable us to reason about the gain of issuing {m additional queries} given little to no information, and show how we can exploit the overlaps across the results of queries for different points of the data domain to obtain accurate estimates of the gain. We cast the problem of {m budgeted entity extraction} over large domains as an adaptive optimization problem that seeks to maximize the number of extracted entities, while minimizing the overall extraction costs. We evaluate our techniques with experiments on both synthetic and real-world datasets, demonstrating a yield of up to 4X over competing approaches for the same budget.
- DataHub: Collaborative data science and dataset version management at scale;
Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran;
CIDR 2015.
[pdf]
[abstract]
Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.
- VERTEXICA: Your Relational Friend for Graph Analytics!;
Alekh Jindal, Praynaa Rawlani, Eugene Wu, Samuel Madden, Amol Deshpande, Mike Stonebraker;
VLDB Demo 2014.
[pdf]
[abstract]
In this paper, we present Vertexica, a graph analytics tools on top of a relational database, which is user friendly and yet highly efficient. Instead of constraining programmers to SQL, Vertexica offers a popular vertex-centric query interface, which is more natural for analysts to express many graph queries. The programmers simply provide their vertex-compute functions and Vertexica takes care of efficiently executing them in the standard SQL engine. The advantage of using Vertexica is its ability to leverage the relational features and enable much more sophisticated graph analysis. These include expressing graph algorithms which are difficult in vertex-centric but straightforward in SQL and the ability to compose end-to-end data processing pipelines, including pre- and post-processing of graphs as well as combining multiple algorithms for deeper insights. Vertexica has a graphical user interface and we outline several demonstration scenarios including, interactive graph analysis, complex graph analysis, and continuous and time series analysis.
- NScale: Neighborhood-centric Analytics on Large Graphs;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
VLDB Demo 2014.
[pdf]
[abstract]
There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego network analysis, motif counting in biological networks, finding social circles, personalized recommendations, link prediction, anomaly detection, analyzing influence cascades, and so on. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks also typically ignore the challenges in extracting the relevant portions of the graph that an analysis task is interested in, and loading it onto distributed memory.
In this demonstration proposal, we describe NScale, a novel end-to-end graph processing framework that enables the distributed execution of complex neighborhood-centric analytics over large-scale graphs in the cloud. NScale enables users to write programs at the level of neighborhoods or subgraphs. NScale uses Apache YARN for efficient and fault-tolerant distribution of data and computation; it features GEL, a novel graph extraction and loading phase, that extracts the relevant portions of the graph and loads them into distributed memory using as few machines as possible. NScale utilizes novel techniques for the distributed execution of user computation that minimize memory consumption by exploiting overlap among the neighborhoods of interest. A comprehensive experimental evaluation shows orders-of-magnitude improvements in performance and total cost over vertex-centric approaches.
- READ: Rapid data Exploration, Analysis and Discovery;
Udayan Khurana, Srinivasan Parthasarathy, Deepak Turaga;
EDBT 2015 (demo).
[abstract]
Exploratory data analysis (EDA) is the process of discovering im- portant characteristics of a dataset or finding data-driven insights in the corresponding domain. EDA is a human intensive process involving data management, analytic flow deployment and model creation, and data visualization and interpretation. It involves ex- tensive use of analyst time, effort, and skill in data processing as well as domain expertise. In this paper, we introduce READ, a mixed initiative system for accelerating exploratory data analysis. The key idea behind READ is to decompose the exploration pro- cess into components that can be independently specified and auto- mated. These components can be defined, reused or extended using simple choice points that are expressed using inference rules, plan- ning logic, and reactive user interfaces and visualization. READ uses a formal specification of the analytic process for automated model space enumeration, workflow composition, deployment, and model validation and clustering. READ aims to reduce the time required for exploration and understanding of a dataset from days to minutes.
- FAQ: A Framework for Fast Approximate Query Processing on Temporal Data;
Udayan Khurana, Srinivasan Parthasarathy, Deepak Turaga;
BIGMINE 2014.
[abstract]
Temporal queries on time evolving data are at the heart of a broad range of business and network intelligence applications ranging from consumer behavior analysis, trend analysis, temporal pattern mining, sentiment analysis on social media, cyber security, and network monitoring. In this work, we present an innovative data structure called Fast Approximate Query-able(FAQ), which provides a unified framework for temporal query processing on Big Data. FAQ uses a novel composition of data sketching, wavelet-style differencing for temporal compression, and quantization, and handles diverse kinds of queries including distinct counts, set membership, frequency estimation, top-K, p-norms, empirical entropy, and distance queries such as Histogram lp-norm distance (including Euclidean and Man- hattan distance), cosine similarity, Jaccard coefficient, and rank correlation. Experiments on a real-life multi dimensional network monitoring data sets demonstrate speedups of 92x achieved by FAQ over a flat representation of data for a mixed temporal query workload..
- SWORD: Workload-aware Data Placement and Replica Selection for Cloud Data Management Systems;
Ashwin Kumar Kayyoor, Abdul Quamar, Amol Deshpande, Samir Khuller;
VLDB Journal (Special Issue on Data-Intensive Cloud Infrastructure), 23(6): 845-870, 2014.
[pdf]
[abstract]
Cloud computing is increasingly being seen as a way to reduce infrastructure costs and add elasticity, and is being used by a wide range of organizations. Cloud data management systems today need to serve a range of different workloads, from analytical read-heavy workloads to transactional (OLTP) workloads. For both the service providers and the users, it is critical to minimize the consumption of resources like CPU, memory, communication bandwidth, and energy, without compromising on service-level agreements if any. In this article, we develop a workload-aware data placement and replication approach, called SWORD, for minimizing resource consumption in such an environment. Specifically, we monitor and model the expected workload as a hypergraph and develop partitioning techniques that minimize the average query span, i.e., the average number of machines involved in the execution of a query or a transaction. We empirically justify the use of query span as the metric to optimize, for both analytical and transactional workloads, and develop a series of replication and data placement algorithms by drawing connections to several well-studied graph theoretic concepts. We introduce a suite of novel techniques to achieve high scalability by reducing the overhead of partitioning and query routing. To deal with workload changes, we propose an incremental repartitioning technique that modifies data placement in small steps without resorting to complete repartitioning. We propose the use of fine-grained quorums defined at the level of groups of data items to control the cost of distributed updates, improve throughput, and adapt to different workloads. We empirically illustrate the benefits of our approach through a comprehensive experimental evaluation for two classes of workloads. For analytical read-only workloads, we show that our techniques result in significant reduction in total resource consumption. For OLTP workloads, we show that our approach improves transaction latencies and overall throughput by minimizing the number of distributed transactions.
- PStore: An Efficient Storage Framework for Managing Scientific Data;
Souvik Bhattacherjee, Amol Deshpande, and Alan Sussman;
SSDBM 2014.
[pdf]
[abstract]
In this paper, we present the design, implementation, and evaluation of PStore, a no-overwrite storage framework for managing large volumes of array data generated by scientific simulations. PStore comprises of two modules, a data ingestion module and a query processing module, that respectively address two of the key challenges in scientific simulation data management. The data ingestion module is geared toward handling the high volumes of simulation data generated at a very rapid rate, which often makes it impossible to offload the data onto storage devices; the module is responsible for selecting an appropriate compression scheme for the data at hand, chunking the data, and then compressing it before sending it to the storage nodes. On the other hand, the query processing module is in charge of efficiently executing different types of queries over the stored data; in this paper, we specifically focus on slice (also called range) queries. PStore provides a suite of compression schemes that leverage existing techniques while extending some of them to provide support for diverse scientific simulation data. To efficiently execute queries over such compressed data, PStore adopts and extends a two-level chunking scheme by incorporating the effect of compression, and hides expensive disk latencies for long running range queries by exploiting chunk prefetching. In addition, we also parallelize the query processing module to further speed up execution. We evaluate PStore on a 140 GB dataset obtained from real-world simulations using the regional climate model CWRF. In this paper, we use both 3D and 4D datasets and demonstrate high performance through extensive experiments.
- EAGr: Supporting Continuous Ego-centric Aggregate Queries over Large Dynamic Graphs;
Jayanta Mondal and Amol Deshpande;
SIGMOD 2014 (also CoRR Technical Report arXiv:1404.6570).
[pdf]
[abstract]
In this work, we present EAGr, a system for supporting large numbers of continuous neighborhood-based ("ego-centric") aggregate queries over large, highly dynamic, and rapidly evolving graphs. Examples of such queries include computation of personalized, tailored trends in social networks, anomaly/event detection in financial transaction networks, local search and alerts in spatio-temporal networks, to name a few. Key challenges in supporting such continuous queries include high update rates typically seen in these situations, large numbers of queries that need to be executed simultaneously, and stringent low latency requirements. We propose a flexible, general, and extensible in-memory framework for executing different types of ego-centric aggregate queries over large dynamic graphs with low latencies. Our framework is built around the notion of an aggregation overlay graph, a pre-compiled data structure that encodes the computations to be performed when an update/query is received. The overlay graph enables sharing of partial aggregates across multiple ego-centric queries (corresponding to the nodes in the graph), and also allows partial pre-computation of the aggregates to minimize the query latencies. We present several highly scalable techniques for constructing an overlay graph given an aggregation function, and also design incremental algorithms for handling structural changes to the underlying graph. We also present an optimal, polynomial-time algorithm for making the pre-computation decisions given an overlay graph, and evaluate an approach to incrementally adapt those decisions as the workload changes. Although our approach is naturally parallelizable, we focus on a single-machine deployment and show that our techniques can easily handle graphs of size up to 320 million nodes and edges, and achieve update/query throughputs of over 500K/s using a single, powerful machine.
- Optimization Techniques for "Scaling Down" Hadoop on Multi-Core, Shared-Memory Systems;
K. Ashwin Kumar, Jonathan Gluck, Amol Deshpande, Jimmy Lin;
EDBT 2014.
[pdf]
[abstract]
The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale datastores are increasingly common, it is unclear whether ``typical'' analytics tasks require more than a single high-end server. Additionally, we are seeing increased sophistication in analytics, e.g., machine learning, which generally operate over smaller and more refined datasets. To address these trends, we propose ``scaling down'' Hadoop to run on shared-memory machines. This paper presents a prototype runtime called Hone, intended to be both API and binary compatible with standard (distributed) Hadoop. That is, Hone can take an existing Hadoop jar and run it, without modification, on a multi-core shared-memory machine. This allows us to take existing Hadoop algorithms and find the most suitable runtime environment for execution on datasets of varying sizes. Our experiments show that Hone order of magnitude faster than Hadoop pseudo-distributed mode (PDM); on dataset sizes that fit into memory, Hone outperforms a fully-distributed 15-node Hadoop cluster in some cases as well.
- Subgraph Pattern Matching over Uncertain Graphs with Identity Linkage Uncertainty;
Walaa Eldin Moustafa, Angelika Kimmig, Amol Deshpande, Lise Getoor;
ICDE 2014 (also CoRR Technical Report arXiv:1305.7006).
[pdf]
[abstract]
There is a growing need for methods which can capture uncertainties and answer queries over graph-structured data. Two common types of uncertainty are uncertainty over the attribute values of nodes and uncertainty over the existence of edges. In this paper, we combine those with identity uncertainty. Identity uncertainty represents uncertainty over the mapping from objects mentioned in the data, or references, to the underlying real-world entities. We propose the notion of a probabilistic entity graph (PEG), a probabilistic graph model that defines a distribution over possible graphs at the entity level. The model takes into account node attribute uncertainty, edge existence uncertainty, and identity uncertainty, and thus enables us to systematically reason about all three types of uncertainties in a uniform manner. We introduce a general framework for constructing a PEG given uncertain data at the reference level and develop highly efficient algorithms to answer subgraph pattern matching queries in this setting. Our algorithms are based on two novel ideas: context-aware path indexing and reduction by join-candidates, which drastically reduce the query search space. A comprehensive experimental evaluation shows that our approach outperforms baseline implementations by orders of magnitude.
- Stream Querying and Reasoning on Social Data;
Jayanta Mondal and Amol Deshpande;
Chapter in Encyclopedia of Social Network Analysis and Mining, Springer, 2014.
[pdf]
[abstract]
In this paper, we present an introduction to the new research area of "stream querying and reasoning" over social data. This area combines aspects from several well-studied research areas, chief among them, social network analysis, graph databases, and data streams. We provide a formal definition of the problem, survey the related prior work, and discuss some of the key research challenges that need to be addressed (and some of the solutions that have been proposed). We note that we use the term "stream reasoning" in this paper to encompass a broad range of tasks including various types of analytics, probabilistic reasoning, statistical inference, and logical reasoning. We contrast our use of this term with the recent work where this term has been used more specifically to refer to integration of logical reasoning systems with data streams in the context of the Semantic Web. Given the vast amount of work on this and related topics, it is not our intention to be comprehensive in this brief overview. Rather we aim to cover some of the key ideas and representative work.
- Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover;
Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik;
SODA 2014 (also CoRR Technical Report arXiv:1303.0726).
[pdf]
[abstract]
Stochastic Boolean Function Evaluation is the problem of determining the value of a given Boolean function f on an unknown input x, when each bit of x_i of x can only be determined by paying an associated cost c_i. The assumption is that x is drawn from a given product distribution, and the goal is to minimize the expected cost. This problem has been studied in Operations Research, where it is known as "sequential testing" of Boolean functions. It has also been studied in learning theory in the context of learning with attribute costs. We consider the general problem of developing approximation algorithms for Stochastic Boolean Function Evaluation. We give a 3-approximation algorithm for evaluating Boolean linear threshold formulas. We also present an approximation algorithm for evaluating CDNF formulas (and decision trees) achieving a factor of O(log kd), where k is the number of terms in the DNF formula, and d is the number of clauses in the CNF formula. In addition, we present approximation algorithms for simultaneous evaluation of linear threshold functions, and for ranking of linear functions. Our function evaluation algorithms are based on reductions to the Stochastic Submodular Set Cover (SSSC) problem. This problem was introduced by Golovin and Krause. They presented an approximation algorithm for the problem, called Adaptive Greedy. Our main technical contribution is a new approximation algorithm for the SSSC problem, which we call Adaptive Dual Greedy. It is an extension of the Dual Greedy algorithm for Submodular Set Cover due to Fujito, which is a generalization of Hochbaum's algorithm for the classical Set Cover Problem. We also give a new bound on the approximation achieved by the Adaptive Greedy algorithm of Golovin and Krause.
- SPARSI: Partitioning Sensitive Data amongst Multiple Adversaries;
Theodoros Rekatsinas, Amol Deshpande, and Ashwin Machanavajjhala;
PVLDB 2013, to be presented at VLDB 2014 (also CoRR Technical Report arXiv:1302.6556).
[pdf]
[abstract]
We present SPARSI, a theoretical framework for partitioning sensitive data across multiple non-colluding adversaries. Most work in privacy-aware data sharing has considered disclosing summaries where the aggregate information about the data is preserved, but sensitive user information is protected. Nonetheless, there are applications, including online advertising, cloud computing and crowdsourcing markets, where detailed and fine-grained user-data must be disclosed. We consider a new data sharing paradigm and introduce the problem of privacy-aware data partitioning, where a sensitive dataset must be partitioned among k untrusted parties (adversaries). The goal is to maximize the utility derived by partitioning and distributing the dataset, while minimizing the amount of sensitive information disclosed. The data should be distributed so that an adversary, without colluding with other adversaries, cannot draw additional inferences about the private information, by linking together multiple pieces of information released to her. The assumption of no collusion is both reasonable and necessary in the above application domains that require release of private user information. SPARSI enables us to formally define privacy-aware data partitioning using the notion of sensitive properties for modeling private information and a hypergraph representation for describing the interdependencies between data entries and private information. We show that solving privacy-aware partitioning is, in general, NP-hard, but for specific information disclosure functions, good approximate solutions can be found using relaxation techniques. Finally, we present a local search algorithm applicable to generic information disclosure functions. We apply SPARSI together with the proposed algorithms on data from a real advertising scenario and show that we can partition data with no disclosure to any single advertiser.
- Hone: "Scaling Down" Hadoop on Shared-Memory Systems;
K. Ashwin Kumar, Jonathan Gluck, Amol Deshpande, Jimmy Lin;
VLDB Demo 2013.
[pdf]
[abstract]
The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale data-stores are increasingly common, it is unclear whether "typical" analytics tasks require more than a single high-end server. Additionally, we are seeing increased sophistication in analytics, e.g., machine learning, which generally operates over smaller and more refined datasets. To address these trends, we propose "scaling down" Hadoop to run on shared-memory machines. This paper presents a prototype runtime called Hone, intended to be both API and binary compatible with standard (distributed) Hadoop. That is, Hone can take an existing Hadoop jar and efficiently execute it, without modification, on a multi-core shared memory machine. This allows us to take existing Hadoop algorithms and find the most suitable run-time environment for execution on datasets of varying sizes. Our experiments show that Hone can be an order of magnitude faster than Hadoop pseudo-distributed mode (PDM); on dataset sizes that fit into memory, Hone can outperform a fully-distributed 15-node Hadoop cluster in some cases as well.
- Data Placement and Replica Selection for Improving Co-location in Distributed Environments;
K. Ashwin Kumar, Amol Deshpande, and Samir Khuller;
CoRR Technical Report arXiv:1302.4168.
[pdf]
[abstract]
Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data management systems, both parallel relational databases, and systems that support alternative frameworks like MapReduce. There is thus an increasing contention on scarce data center resources like network bandwidth; further, the energy requirements for powering the computing equipment are also growing dramatically. As we show empirically, increasing the execution parallelism by spreading out data across a large number of machines may achieve the intended goal of decreasing query latencies, but in most cases, may increase the total resource and energy consumption significantly. For many analytical workloads, however, minimizing query latencies is often not critical; in such scenarios, we argue that we should instead focus on minimizing the average query span, i.e., the average number of machines that are involved in processing of a query, through colocation of data items that are frequently accessed together. In this work, we exploit the fact that most distributed environments need to use replication for fault tolerance, and we devise workload-driven replica selection and placement algorithms that attempt to minimize the average query span. We model a historical query workload trace as a hypergraph over a set of data items, and formulate and analyze the problem of replica placement by drawing connections to several well-studied graph theoretic concepts. We develop a series of algorithms to decide which data items to replicate, and where to place the replicas. We show effectiveness of our proposed approach by presenting results on a collection of synthetic and real workloads. Our experiments show that careful data placement and replication can dramatically reduce the average query spans resulting in significant reductions in the resource consumption.
- GrDB: A System for Declarative and Interactive Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Hui Miao, Amol Deshpande, Lise Getoor;
SIGMOD Demo 2013.
[pdf]
[abstract]
There is a growing interest in methods for analyzing data describing networks of all types, including biological, physical, social, and scientific collaboration networks. Typically the data describing these networks is observational, and thus noisy and incomplete; it is often at the wrong level of fidelity and abstraction for meaningful data analysis. This demonstration presents GrDB, a system that enables data analysts to write declarative programs to specify and combine different network data cleaning tasks, visualize the output, and engage in the process of decision review and correction if necessary. The declarative interface of GrDB makes it very easy to quickly write analysis tasks and execute them over data, while the visual component facilitates debugging the program and performing fine grained corrections.
- HiNGE: Enabling Temporal Network Analytics at Scale;
Udayan Khurana, and Amol Deshpande;
SIGMOD Demo 2013.
[pdf]
[abstract]
In this demonstration proposal, we present HiNGE (Historical Network/Graph Explorer), a system that enables interactive exploration and analytics over large evolving networks through visualization and node-centric metric computations. HiNGE is built on top of a distributed graph database system that stores the entire history of a network, and enables efficiently retrieving and analyzing multiple graph snapshots from arbitrary time points in the past. The cornerstone of our system is a novel hierarchical parallelizable index structure, called DeltaGraph, that enables compact recording of the historical trace of a network on disk, and supports efficient retrieval of historical snapshots for single-site or parallel processing. The other key component of our system is an in-memory graph data structure, called GraphPool, that can maintain hundreds of historical graph snapshots in main memory in a non-redundant manner. We demonstrate the efficient and usability of our system at performing temporal analytics over large-scale dynamic networks.
- Algorithms for the Thermal Scheduling Problem;
Koyel Mukherjee, Samir Khuller, and Amol Deshpande;
IPDPS 2013.
[pdf]
[abstract]
The energy costs for cooling a data center constitute a significant portion of the overall running costs. Thermal imbalance and hot spots that arise due to imbalanced workloads lead to significant wasted cooling effort -- in order to ensure that no equipment is operating above a certain temperature, the data center may be cooled more than necessary. Therefore it is desirable to schedule the workload in a data center in a "thermally aware" manner, assigning jobs to machines not just based on local load of the machines, but based on the overall thermal profile of the data center. This is challenging because of the spatial cross-interference between machines, where a job assigned to a machine may impact not only that machine's temperature, but also nearby machines. Here, we continue formal analysis of the "thermal scheduling" problem that we initiated recently. There we introduced the notion of "effective load of a machine" which is a function of the local load on the machine as well as the load on nearby machines, and presented optimal scheduling policies for a simple model (where cross-effects are restricted within a rack) under the assumption that jobs can be split among different machines. Here we consider the more realistic problem of "integral" assignment of jobs, and allow for cross-interference among different machines in adjacent racks in the data center. The integral assignment problem with cross-interference is NP-hard, even for a simple two machine model. We consider three different heat flow models, and give constant factor approximation algorithms for maximizing the number (or total profit) of jobs assigned in each model, without violating thermal constraints. We also consider the problem of minimizing the maximum temperature on any machine when all jobs need to be assigned, and give constant factor algorithms for this problem.
- SWORD: Scalable Workload-Aware Data Placement for Transactional Workloads;
Abdul Quamar, K.Ashwin Kumar and Amol Deshpande;
EDBT 2013.
[pdf]
[abstract]
In this paper, we address the problem of transparently scaling out transactional (OLTP) workloads on relational databases, to support "database-as-a-service" in cloud computing environment. The primary challenges in supporting such workloads include choosing how to "partition" the data across a large number of machines, minimizing the number of "distributed transactions", providing high data availability, and tolerating failures gracefully. Capturing and modeling the transactional workload over a period of time, and then exploiting that information for data placement and replication has been shown to provide significant benefits in performance, both in terms of transaction latencies and overall throughput. However, such workload-aware data placement approaches can incur very high overheads, and further, may perform worse than naive approaches if the workload changes. In this work, we propose SWORD, a scalable workload-aware data partitioning and placement approach for OLTP workloads, that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement, and during query execution at runtime. We model the workload as a hypergraph over the data items, and propose using a "hypergraph compression" technique to reduce the overheads of partitioning. To deal with workload changes, we propose an incremental data repartitioning technique that modifies data placement in small steps without resorting to complete workload repartitioning. We have built a workload-aware "active replication" mechanism in SWORD to increase availability and enable load balancing. We propose the use of "fine-grained quorums" defined at the level of groups of tuples to control the cost of distributed updates, improve throughput, and provide adaptability to different workloads. To our knowledge, SWORD is the first system that uses fine-grained quorums in this context. The results of our experimental evaluation on SWORD deployed on an Amazon EC2 cluster show that our techniques result in orders-of-magnitude reductions in the partitioning and book-keeping overheads, and improve tolerance to failures and workload changes; we also show that choosing quorums based on the query access patterns enables us to better handle query workloads with different read and write access patterns.
- Efficient Snapshot Retrieval over Historical Graph Data;
Udayan Khurana, and Amol Deshpande;
ICDE 2013 (also CoRR Technical Report arXiv:1207.5777).
[pdf]
[abstract]
We address the problem of managing historical data for large evolving information networks like social networks or citation networks, with the goal to enable temporal and evolutionary queries and analysis. We present the design and architecture of a distributed graph database system that stores the entire history of a network and provides support for efficient retrieval of multiple graphs from arbitrary time points in the past, in addition to maintaining the current state for ongoing updates. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compactly recording the historical information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Along with the original graph data, DeltaGraph can also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right parameters for a specific scenario. In addition, we present strategies for materializing portions of the historical graph state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information.
- Efficiently Adapting Graphical Models for Selectivity Estimation;
Kostas Tzoumas, Amol Deshpande, and Christian Jensen;
VLDB Journal (Special Issue: Best Papers of VLDB 2011), 22(1): 3-27, February 2013.
[pdf]
[abstract]
Query optimizers rely on statistical models that succinctly describe the underlying data. Models are used to derive cardinality estimates for intermediate relations, which in turn guide the optimizer to choose the best query execution plan. The quality of the resulting plan is highly dependent on the accuracy of the statistical model that represents the data. It is well known that small errors in the model estimates propagate exponentially through joins, and may result in the choice of a highly sub-optimal query execution plan. Most commercial query optimizers make the attribute value independence assumption: all attributes are assumed to be statistically independent. This reduces the statistical model of the data to a collection of one-dimensional synopses (typically in the form of histograms), and it permits the optimizer to estimate the selectivity of a predicate conjunction as the product of the selectivities of the constituent predicates. However, this independence assumption is more often than not wrong, and is considered to be the most common cause of sub-optimal query execution plans chosen by modern query optimizers. We take a step towards a principled and practical approach to performing cardinality estimation without making the independence assumption. By carefully using concepts from the field of graphical models, we are able to factor the joint probability distribution over all the attributes in the database into small, usually two-dimensional distributions, without a significant loss in estimation accuracy. We show how to efficiently construct such a graphical model from the database using only two-way join queries, and we show how to perform selectivity estimation in a highly efficient manner. We integrate our algorithms into the PostgreSQL DBMS. Experimental results indicate that estimation errors can be greatly reduced, leading to orders of magnitude more efficient query execution plans in many cases. Optimization time is kept in the range of tens of milliseconds, making this a practical approach for industrial-strength query optimizers.
- Parallel Pipelined Filter Ordering with Precedence Constraints;
Amol Deshpande, Lisa Hellerstein;
ACM Transactions on Algorithms 8, 4, Article 41 (September 2012), 28 pages..
[pdf]
[abstract]
In the parallel pipelined filter ordering problem, we are given a set of `n' filters that run in parallel. The filters need to be applied to a stream of elements, to determine which elements pass all filters. Each filter has a "rate limit" r_i on the number of elements it can process per unit time, and a "selectivity" p_i, which is the probability that a random element will pass the filter. The goal is to maximize throughput. This problem appears naturally in a variety of settings, including parallel query optimization in databases and query processing over Web services. We present an O(n^3) algorithm for the above problem, given tree-structured precedence constraints on the filters. This extends work of Condon et al. and Kodialam, who presented algorithms for solving the problem without precedence constraints. Our algorithm is combinatorial and produces a sparse solution. Motivated by join operators in database queries, we also give algorithms for versions of the problem in which "filter" selectivities may be greater than or equal to 1. We prove a strong connection between the more classical problem of minimizing total work in sequential filter ordering (A), and the parallel pipelined filter ordering problem (B). More precisely, we prove that A is solvable in polynomial time for a given class of precedence constraints if and only if B is as well. This equivalence allows us to show that B is NP-Hard in the presence of arbitrary precedence constraints (since A known to be NP-Hard in that setting).
- Managing Large Dynamic Graphs Efficiently;
Jayanta Mondal, Amol Deshpande;
SIGMOD 2012.
[pdf]
[abstract]
There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks etc. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and node/edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on single-site graph databases and on efficiently executing specific types of queries over large graphs, to date there is little work on understanding the challenges in distributed dynamic graph data management, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large dynamically changing graph and supporting low-latency query processing over it. The key challenge in a distributed graph database is that, partitioning a graph across a set of machines inherently results in a large number of distributed traversals across partitions to answer even simple queries. We propose a suite of novel techniques to minimize the communication bandwidth and the storage requirements. We have implemented our framework as a middle-ware on top of an open-source key-value store. We evaluate our system on a social graph, and show that our system is able to handle very large graphs efficiently, and that it reduces the network bandwidth consumption significantly.
- Local Structure and Determinism in Probabilistic Databases;
Theodoros Rekatsinas, Amol Deshpande, Lise Getoor;
SIGMOD 2012.
[pdf]
[abstract]
While extensive work has been done on evaluating queries under the tuple-independence assumption, query evaluation over correlated data has received much less attention even though the support for correlations is essential for many natural applications of probabilistic databases (e.g., information extraction, data integration, computer vision, etc.). In this paper, we develop a novel approach for efficiently evaluating probabilistic queries over correlated databases where correlations are represented using a "factor graph", a class of graphical models widely used for capturing correlations and performing statistical inference. Our approach exploits the specific values of the factor parameters and determinism in the correlations, collectively called "local structure", to reduce the complexity of query evaluation. Our framework is based on "arithmetic circuits", factorized representations of probability distributions that can exploit such local structure. Traditionally, arithmetic circuits are generated following a compilation process and can not be updated directly. We introduce a generalization of arithmetic circuits, called "annotated arithmetic circuits", and a novel algorithm for updating them, which enables us to answer probabilistic queries efficiently. We present a comprehensive experimental analysis and show speed-ups of at least one order of magnitude in many cases.
- Saving on Cooling: The Thermal Scheduling Problem;
Koyel Mukherjee, Samir Khuller, and Amol Deshpande;
SIGMETRICS 2012 (Short paper).
[pdf]
[abstract]
In this abstract we define some very basic scheduling problems motivated by increasing power density and consequent cooling considerations in data centers and multi-core chips.
- Ego-centric Graph Pattern Census;
Walaa Eldin Moustafa, Amol Deshpande, Lise Getoor;
ICDE 2012.
[abstract]
There is increasing interest in analyzing networks of all types including social, biological, sensor, computer, and transportation networks. Broadly speaking, we may be interested in global network-wide analysis (e.g., centrality analysis, community detection) where the properties of the entire network are of interest, or local ego-centric analysis where the focus is on studying the properties of nodes (egos) by analyzing their neighborhood subgraphs. In this paper we propose and study ego-centric pattern census queries, a new type of graph analysis query, where a given structural pattern is searched in every node's neighborhood and the counts are reported or used in further analysis. This kind of analysis is useful in many domains in social network analysis including opinion leader identification, node classification, link prediction, and role identification. We propose an SQL-based declarative language to support this class of queries, and develop a series of efficient query evaluation algorithms for it. We evaluate our algorithms on a variety of synthetically generated graphs. We also show an application of our language in a real-world scenario for predicting future collaborations from DBLP data.
- Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems;
Jian Li, and Amol Deshpande;
FOCS 2011 (also CoRR Technical Report arXiv:1012.3189).
[pdf]
[abstract]
We study the stochastic versions of a broad class of combinatorial problems where the weights of the elements in the input dataset are uncertain. The class of problems that we study includes shortest paths, minimum weight spanning trees, and minimum weight matchings over probabilistic graphs, and other combinatorial problems like knapsack. We observe that the expected value is inadequate in capturing different types of "risk-averse" or "risk-prone" behaviors, and instead we consider a more general objective which is to maximize the "expected utility" of the solution for some given utility function, rather than the expected weight (expected weight becomes a special case). We show that we can obtain a polynomial time approximation algorithm with "additive error" "epsilon" for any "epsilon>0", if there is a pseudopolynomial time algorithm for the "exact" version of the problem. (This is true for the problems mentioned above). Our result generalizes several prior results on stochastic shortest path, stochastic spanning tree, and stochastic knapsack. Our algorithm for utility maximization makes use of the separability of exponential utility and a technique to decompose a general utility function into exponential utility functions, which may be useful in other stochastic optimization problems.
- Lightweight Graphical Models for Selectivity Estimation Without Independence Assumptions;
Kostas Tzoumas, Amol Deshpande, and Christian Jensen;
VLDB 2011.
[pdf]
[abstract]
As a result of decades of research and industrial development, modern query optimizers are complex software artifacts. However, the quality of the query plan chosen by an optimizer is largely determined by the quality of the underlying statistical summaries. Small selectivity estimation errors, propagated exponentially, can lead to severely sub-optimal plans. Modern optimizers typically maintain one-dimensional statistical summaries and make the attribute value independence and join uniformity assumptions for efficiently estimating selectivities. Therefore, selectivity estimation errors in today's optimizers are frequently caused by missed correlations between attributes. We present a selectivity estimation approach that does not make the independence assumptions. By carefully using concepts from the field of graphical models, we are able to factor the joint probability distribution of all the attributes in the database into small, usually two dimensional distributions. We describe several optimizations that can make selectivity estimation highly efficient, and we present a complete implementation inside PostgreSQL's query optimizer. Experimental results indicate an order of magnitude better selectivity estimates, while keeping optimization time in the range of tens of milliseconds.
- Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases;
Bhargav Kanagal, Jian Li, Amol Deshpande;
SIGMOD 2011.
[pdf]
[abstract]
Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to provide "explanations" for query results, e.g., ``Why is this tuple in my result?'' or ``Why does this output tuple have such high probability?''. Second, the problem of determining the "sensitive" input tuples for the given query, e.g., users are interested to know the input tuples that can substantially alter the output, when their probabilities are modified (since they may be unsure about the input probability values). Existing systems provide the "lineage/provenance" of each of the output tuples in addition to the output probabilities, which is a boolean formula indicating the dependence of the output tuple on the input tuples. However, lineage does not immediately provide a quantitative relationship and it is not informative when we have multiple output tuples. In this paper, we propose a unified framework that can handle both the issues mentioned above to facilitate robust query processing. We formally define the notions of "influence" and "explanations" and provide algorithms to determine the top-l influential set of variables and the top-l set of explanations for a variety of queries, including "conjunctive" queries, "probabilistic threshold" queries, "top-k" queries and "aggregation" queries. Further, our framework naturally enables highly efficient incremental evaluation when input probabilities are modified (e.g., if uncertainty is resolved). Our preliminary experimental results demonstrate the benefits of our framework for performing robust query processing over probabilistic databases.
- A Unified Approach to Ranking in Probabilistic Databases;
Jian Li and Barna Saha and Amol Deshpande;
VLDB Journal, 20(2): 249-275, 2011.
[pdf]
[abstract]
Ranking is a fundamental operation in data analysis and decision support, and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank the tuples in a probabilistic dataset in recent years. In this article, we present a unified approach to ranking and top-$k$ query processing in probabilistic databases by viewing it as a multi-criteria optimization problem, and by deriving a set of "features" that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databases, and we instead propose two "parameterized ranking functions", called PRFs and PRFe, that generalize or can approximate many of the previously proposed ranking functions. We present novel "generating functions"-based algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations modeled using "probabilistic and/xor trees" or "Markov networks". We further propose that the parameters of the ranking function be "learned" from user preferences, and we develop an approach to learn those parameters. Finally, we present a comprehensive experimental study that illustrates the effectiveness of our parameterized ranking functions, especially PRFe, at approximating other ranking functions and the scalability of our proposed algorithms for exact or approximate ranking.
- Declarative Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Galileo Namata, Amol Deshpande, Lise Getoor;
ICDE Workshop on Graph Data Management: Techniques and Applications (GDM 2011).
[abstract]
There is a growing interest in methods for analyzing data describing networks of all types, including information, biological, physical, and social networks. Typically the data describing these networks is observational, and thus noisy and incomplete; it is often at the wrong level of fidelity and abstraction for meaningful data analysis. This has resulted in a growing body of work on extracting, cleaning, and annotating network data. Unfortunately, much of this work is ad hoc and domain-specific. In this paper, we present the architecture of a data management system that enables efficient, declarative analysis of large-scale information networks. We identify a set of primitives to support the extraction and inference of a network from observational data, and describe a framework that enables a network analyst to easily implement and combine new extraction and analysis techniques, and efficiently apply them to large observation networks. The key insight behind our approach is to "decouple", to the extent possible, (a) the operations that require traversing the graph structure (typically the computationally expensive step), from (b) the operations that do the modification and update of the extracted network. We present an analysis language based on "Datalog", and show how to use it to cleanly achieve such decoupling. We briefly describe our prototype system that supports these abstractions. We include a preliminary performance evaluation of the system and show that our approach scales well and can efficiently handle a wide spectrum of data cleaning operations on network data.
- Energy Efficient Monitoring in Sensor Networks;
Amol Deshpande and Samir Khuller and Azarakhsh Malekian and Mohammed Toossi;
Algorithmica, Volume 59, Number 1, 94-114, 2011.
[pdf]
[abstract]
We study a set of problems related to efficient battery energy utilization for monitoring applications in a wireless sensor network with the goal to increase the sensor network lifetime. We study several generalizations of a basic problem called Set k-Cover. The problem can be described as follows: we are given a set of sensors, and a set of targets to be monitored. Each target can be monitored by a subset of the sensors. To increase the lifetime of the sensor network, we would like to partition the sensors into k sets (or time-slots), and activate each set of sensors in a different time-slot, thus extending the battery life of the sensors by a factor of k. The goal is to find a partitioning that maximizes the total coverage of the targets for a given k. This problem is known to be NP-hard. We develop an improved approximation algorithm for this problem using a reduction to Max k-Cut. Moreover, we are able to demonstrate that this algorithm is efficient, and yields almost optimal solutions in practice. We also consider generalizations of this problem in several different directions. First, we allow each sensor to be active in α different sets (time-slots). This means that the battery life is extended by a factor of k/alpha, and allows for a richer space of solutions. We also consider different coverage requirements, such as requiring that all targets, or at least a certain number of targets, be covered in each time slot. In the Set k-Cover formulation, there is no requirement that a target be monitored at all, or in any number of time slots. We develop a randomized rounding algorithm for this problem. We also consider extensions where each sensor can monitor only a bounded number of targets in any time-slot, and not all the targets adjacent to it. This kind of problem may arise when a sensor has a directional camera, or some other physical constraint might prevent it from monitoring all adjacent targets even when it is active. We develop the first approximation algorithms for this problem.
- A Temporal Pattern Search Algorithm for Personal History Event Visualization;
Taowei Wang, Amol Deshpande, Ben Shneiderman;
IEEE TKDE 2010.
[pdf]
[abstract]
We present Temporal Pattern Search (TPS), a novel algorithm for searching for temporal patterns of events in historical personal histories. The traditional method of searching for such patterns uses an automaton-based approach over a single array of events, sorted by time stamps. Instead, TPS operates on a set of arrays, where each array contains all events of the same type, sorted by time stamps. TPS searches for a particular item in the pattern using a binary search over the appropriate arrays. Although binary search is considerably more expensive per item, it allows TPS to skip many unnecessary events in personal histories. We show that TPS's running time is bounded by O(m^2 n lg(n)), where m is the number of items in a search pattern, and n is the number of events in a history. Although the asymptotic running time of TPS is inferior to that of a non-deterministic finite automaton (NFA) approach (O(mn)), TPS performs better than NFA under our experimental conditions. We also show TPS is very competitive to Shift-And, a bit-parallelism approach, with real data. Since the experimental conditions we describe here subsume the conditions under which analysts would typically use TPS (i.e. within an interactive visualization program), we argue that TPS is an appropriate design choice for us.
- Ranking Continuous Probabilistic Datasets;
Jian Li, and Amol Deshpande;
VLDB 2010.
[pdf]
[abstract]
Ranking is a fundamental operation in data analysis and decision support, and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank uncertain datasets in recent years. In this paper, we address the problem of ranking when the tuple scores are uncertain, and the uncertainty is captured using "continuous" probability distributions (e.g. Gaussian distributions). We present a comprehensive solution to compute the values of a "parameterized ranking function" (PRF) for arbitrary continuous probability distributions (and thus rank the uncertain dataset); PRF can be used to simulate or approximate many other ranking functions proposed in prior work. We develop exact polynomial time algorithms for some continuous probability distribution classes, and efficient approximation schemes with provable guarantees for arbitrary probability distributions. Our algorithms can also be used for exact or approximate evaluation of k-nearest neighbor queries over uncertain objects, whose positions are modeled using continuous probability distributions. Our experimental evaluation over several datasets illustrates the effectiveness of our approach at efficiently ranking uncertain datasets with continuous attribute uncertainty.
- Sharing-Aware Horizontal Partitioning for Exploiting Correlations during Query Processing;
Kostas Tzoumas, Amol Deshpande, and Christian Jensen;
VLDB 2010.
[pdf]
[abstract]
Optimization of join queries based on average selectivities is sub-optimal in highly correlated databases. In such databases, relations are naturally divided into partitions, each partition having substantially different statistical characteristics. It is very compelling to discover such data partitions during query optimization and create multiple plans for a given query, one plan being optimal for a particular combination of data partitions. This scenario calls for the sharing of state among plans, so that common intermediate results are not recomputed. We study this problem in a setting with a routing-based query execution engine based on eddies. Eddies naturally encapsulate horizontal partitioning and maximal state sharing across multiple plans. We define the notion of a conditional join plan, a novel representation of the search space that enables us to address the problem in a principled way. We present a low-overhead greedy algorithm that uses statistical summaries based on graphical models. Experimental results suggest an one order of magnitude faster execution time over traditional optimization for high correlations, while maintaining the same performance for low correlations.
- Read-Once Functions and Query Evaluation in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB 2010.
[pdf]
[abstract]
Probabilistic databases hold promise of being a viable means for large-scale uncertainty management, increasingly needed in a number of real world applications domains. However, query evaluation in probabilistic databases remains a computational challenge. Prior work on efficient "exact" query evaluation in probabilistic databases has largely concentrated on query-centric formulations (e.g., "safe plans", "hierarchical queries"), in that, they only consider characteristics of the query and not the data in the database. It is easy to construct examples where a supposedly hard query run on an appropriate database gives rise to a tractable query evaluation problem. In this paper, we develop efficient query evaluation techniques that leverage characteristics of both the query and the data in the database. We focus on tuple-independent databases where the query evaluation problem is equivalent to computing marginal probabilities of Boolean formulas associated with the result tuples. This latter task is easy if the Boolean formulas can be factorized into a form that has every variable appearing at most once (called "read-once"). However, a naive approach that directly uses previously developed Boolean formula factorization algorithms is inefficient, because those algorithms require the input formulas to be in the disjunctive normal form (DNF). We instead develop novel, more efficient factorization algorithms that directly construct the read-once expression for a result tuple Boolean formula (if one exists), for a large subclass of queries (specifically, conjunctive queries without self-joins). We empirically demonstrate that (1) our proposed techniques are orders of magnitude faster than generic inference algorithms for queries where the result Boolean formulas can be factorized into read-once expressions, and (2) for the special case of hierarchical queries, they rival the efficiency of prior techniques specifically designed to handle such queries.
- Lineage Processing on Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2010.
[pdf]
[abstract]
In this paper, we address the problem of scalably evaluating conjunctive queries over correlated probabilistic databases containing tuple or attribute uncertainties. Like previous work, we adopt a two-phase approach where we first compute "lineages" of the output tuples, and then compute the probabilities of the lineage formulas. However unlike previous work, we allow for arbitrary and complex correlations to be present in the data, captured via a forest of "junction trees". We observe that evaluating even read-once (tree structured) lineages (e.g., those generated by "hierarchical" conjunctive queries), polynomially computable over tuple independent probabilistic databases, is #P-complete for lightly correlated probabilistic databases like "Markov sequences". We characterize the complexity of exact computation of the probability of the lineage formula on a correlated database using a parameter called "lwidth" (analogous to the notion of "treewidth"). For lineages that result in low lwidth, we compute exact probabilities using a novel message passing algorithm, and for lineages that induce large lwidths, we develop approximate Monte Carlo algorithms to estimate the result probabilities. We scale our algorithms to very large correlated probabilistic databases using the previously proposed INDSEP data structure. To mitigate the complexity of lineage evaluation, we develop optimization techniques to process a batch of lineages by sharing computation across formulas, and to exploit any independence relationships that may exist in the data. Our experimental study illustrates the benefits of using our algorithms for processing lineage formulas over correlated probabilistic databases.
- On Computing Compression Trees for Data Collection in Wireless Sensor Networks;
Jian Li, Amol Deshpande, and Samir Khuller;
INFOCOM 2010 (also CoRR Technical Report arXiv:0907.5442).
[pdf]
[abstract]
We address the problem of efficiently gathering correlated data from a wired or a wireless sensor network, with the aim of designing algorithms with provable optimality guarantees, and understanding how close we can get to the known theoretical lower bounds. Our proposed approach is based on finding an optimal or a near-optimal "compression tree" for a given sensor network: a compression tree is a directed tree over the sensor network nodes such that the value of a node is compressed using the value of its parent. We consider this problem under different communication models, including the "broadcast communication" model that enables many new opportunities for energy-efficient data collection. We draw connections between the data collection problem and a previously studied graph concept, called "weakly connected dominating sets", and we use this to develop novel approximation algorithms for the problem. We present comparative results on several synthetic and real-world datasets showing that our algorithms construct near-optimal compression trees that yield a significant reduction in the data collection cost.
- PrDB: Managing and Exploiting Rich Correlations in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB Journal Special Issue on Uncertain and Probabilistic Databases, 18(6): 1065-1090, 2009.
[pdf]
[abstract]
Due to numerous applications producing noisy data, e.g., sensor data, experimental data, data from uncurated sources, information extraction, etc., there has been a surge of interest in the development of probabilistic databases. Most probabilistic database models proposed to date, however, fail to meet the challenges of real-world applications on two counts: (1) they often restrict the kinds of uncertainty that the user can represent; and (2) the query processing algorithms often cannot scale up to the needs of the application. In this work, we define a probabilistic database model, "PrDB", that uses graphical models, a state-of-the-art probabilistic modeling technique developed within the statistics and machine learning community, to model uncertain data. We show how this results in a rich, complex yet compact probabilistic database model, which can capture the commonly occurring uncertainty models (tuple uncertainty, attribute uncertainty), more complex models (correlated tuples and attributes) and allows compact representation (shared and schema-level correlations). In addition, we show how query evaluation in PrDB translates into inference in an appropriately augmented graphical model. This allows us to easily use any of a myriad of exact and approximate inference algorithms developed within the graphical modeling community. While probabilistic inference provides a generic approach to solving queries, we show how the use of shared correlations, together with a novel inference algorithm that we developed based on bisimulation, can speed query processing significantly. We present a comprehensive experimental evaluation of the proposed techniques and show that even with a few shared correlations, significant speedups are possible.
- A Unified Approach to Ranking in Probabilistic Databases;
Jian Li and Barna Saha and Amol Deshpande;
VLDB 2009 (also CoRR Technical Report arXiv:0904.1366).
[pdf]
[abstract]
The dramatic growth in the number of application domains that naturally generate probabilistic, uncertain data has resulted in a need for efficiently supporting complex querying and decision-making over such data. In this paper, we present a unified approach to ranking and top-k query processing in probabilistic databases by viewing it as a multi-criteria optimization problem, and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databases, and we instead propose two parameterized ranking functions, called PRF-w and PRF-e, that generalize or can approximate many of the previously proposed ranking functions. We present novel generating functions-based algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations modeled using probabilistic and/xor trees or Markov networks. We further propose that the parameters of the ranking function be learned from user preferences, and we develop an approach to learn those parameters. Finally, we present a comprehensive experimental study that illustrates the effectiveness of our parameterized ranking functions, especially PRF-e, at approximating other ranking functions and the scalability of our proposed algorithms for exact or approximate ranking.
- Bisimulation-based Approximate Lifted Inference;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
UAI 2009.
[pdf]
[abstract]
There has been a great deal of recent interest in methods for performing lifted inference, however most of this work assumes that the first-order model is given as input to the system. Here, we describe lifted inference algorithms that determine symmetries and automatically "lift" the probabilistic model to speedup inference. In particular, we describe approximate lifted inference techniques that allow the user to trade off inference accuracy for computational efficiency by using a handful of tunable parameters, while keeping the error bounded. Our algorithms are closely related to the graph-theoretic concept of bisimulation. We report experiments on both synthetic and real data to show that in the presence of symmetries, run-times for inference can be improved significantly with approximate lifted inference providing speedups of upto 2 orders of magnitude over ground inference.
- Consensus Answers for Queries over Probabilistic Databases;
Jian Li and Amol Deshpande;
PODS 2009 (also CoRR Technical Report arXiv:0812.2049v1).
[pdf]
[abstract]
We address the problem of finding a "best" deterministic query answer to a query over a probabilistic database. For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic world (answer) that minimizes the expected distance to the possible worlds (answers). This problem can be seen as a generalization of the well-studied inconsistent information aggregation problems (e.g. rank aggregation) to probabilistic databases. We consider this problem for various types of queries including SPJ queries, Top-k queries, group-by aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal or approximation algorithms for computing the consensus answers (or prove NP-hardness). Most of our results are for a general probabilistic database model, called "and/xor tree model", which significantly generalizes previous probabilistic database models like x-tuples and block-independent disjoint models, and is of independent interest.
- Indexing Correlated Probabilistic Databases;
Bhargav Kanagal, Amol Deshpande;
SIGMOD 2009.
[pdf]
[abstract]
With large amounts of correlated probabilistic data being generated in a wide range of application domains including sensor networks, information extraction, event detection etc., effectively managing and querying them has become an important research challenge. While there is an exhaustive body of literature on querying independent probabilistic data, supporting efficient queries over large-scale, correlated databases remains a challenge. In this paper, we develop efficient data structures and indexes for supporting inference and decision support queries over such databases. Our proposed hierarchical data structure is suitable both for in-memory and disk-resident databases. We represent the correlations in the probabilistic database using a "junction tree" over the tuple-existence or attribute-value random variables, and use "tree partitioning" techniques to build an index structure over it. We show how to efficiently answer inference and aggregation queries using such an index, resulting in orders of magnitude performance benefits in most cases. In addition, we develop novel algorithms for efficiently keeping the index structure up-to-date as changes (inserts, updates) are made to the probabilistic database. We present a comprehensive experimental study illustrating the benefits of our approach to query processing in probabilistic databases.
- Minimizing Communication Cost in Distributed Multi-query Processing;
Jian Li, Amol Deshpande, and Samir Khuller;
ICDE 2009.
[pdf]
[abstract]
Increasing prevalence of large-scale distributed monitoring and computing environments such as sensor networks, scientific federations, Grids etc., has led to a renewed interest in the area of distributed query processing and optimization. In this paper we address a general, distributed multi-query processing problem motivated by the need to minimize the communication cost in these environments. Specifically we address the problem of optimally sharing data movement across the communication edges in the communication network given a set of overlapping queries and query plans for them (specifying the join orders to be used to execute the queries). Most of the problem variations of our general problem can be shown to be NP-Hard from a reduction from the Steiner tree problem. However, we show that the problem can be solved optimally if the communication network is a tree, and present a novel algorithm for finding the optimal sharing plan. For general communication networks, we present efficient approximation algorithms for several variations of the problem. Finally, we present a preliminary experimental study over synthetic datasets showing both the need for exploiting sharing of data movement and the effectiveness of our algorithms at finding such plans.
- Efficient Query Evaluation over Temporally Correlated Probabilistic Streams;
Bhargav Kanagal, Amol Deshpande;
ICDE 2009 (short paper).
[pdf]
[abstract]
Many real world applications such as sensor networks and other monitoring applications naturally generate probabilistic streams that are highly correlated in both time and space. Query processing over such streaming data must be cognizant of these correlations, since they can significantly alter the final query results. Several prior works have suggested approaches to handling correlations in probabilistic databases. However those approaches are either unable to represent the types of correlations that probabilistic streams exhibit, or can not be applied directly to our problem because of their complexity. In this paper, we develop a system for managing and querying such streams by exploiting the fact that most real-world probabilistic streams exhibit highly structured Markovian correlations. Our approach is based on the previously proposed framework of viewing probabilistic query evaluation as inference over graphical models; we show how to efficiently construct graphical models for the common stream processing operators, and how to efficiently perform inference over them in an incremental fashion. Our extensive experimental evaluation illustrates the advantages of exploiting the structured nature of correlations in probabilistic streams.
- Graphical Models for Uncertain Data;
Amol Deshpande, Lise Getoor, and Prithviraj Sen;
Book Chapter. Managing and Mining Uncertain Data, ed. C. Aggarwal, Springer, 2009..
[pdf]
[abstract]
Graphical models are a popular and well-studied framework for compact representation of a joint probability distribution over a large number of interdependent variables, and for efficient reasoning about such a distribution. They have been proven useful in a wide range of domains from natural language processing to computer vision to bioinformatics. In this chapter, we present an approach to using graphical models for managing and querying large-scale uncertain databases. We present a unified framework based on the concepts from graphical models that can model not only tuple-level and attribute-level uncertainties, but can also handle arbitrary correlations that may be present among the data; our framework can also naturally capture "shared correlations" where the same uncertainties and correlations occur repeatedly in the data. We develop an efficient strategy for query evaluation over such probabilistic databases by casting the query processing problem as an "inference" problem in an appropriately constructed graphical model, and present optimizations specific to probabilistic databases that enable efficient query evaluation. We conclude the chapter with a discussion of related and future work on these topics.
- Algorithms for Distributional and Adversarial Pipelined Filter Ordering Problems;
Anne Condon, Amol Deshpande, Lisa Hellerstein, and Ning Wu;
ACM Transactions of Algorithms, 5(2), Article 24, March 2009..
[pdf]
[abstract]
Pipelined filter ordering is a central problem in database query optimization. The problem is to determine the optimal order in which to apply a given set of commutative filters (predicates) to a set of elements (the tuples of a relation), so as to find, as efficiently as possible, the tuples that satisfy all of the filters. Optimization of pipelined filter ordering has recently received renewed attention in the context of environments such as the web, continuous high-speed data streams, and sensor networks. Pipelined filter ordering problems are also studied in areas such as fault detection and machine learning under names such as learning with attribute costs, minimum-sum set cover, and satisficing search. We present algorithms for two natural extensions of the classical pipelined filter ordering problem: (1) a "distributional type" problem where the filters run in parallel and the goal is to maximize throughput, and (2) an "adversarial type" problem where the goal is to minimize the expected value of "multiplicative regret". We present two related algorithms for solving (1), both running in time O(n^2), which improve on the O(n^3 log(n)) algorithm of Kodialam. We use techniques from our algorithms for (1) to obtain an algorithm for (2).
- Model-based Querying in Sensor Networks;
Amol Deshpande, Carlos Guestrin, Samuel Madden;
Chapter in Encyclopedia of Database Systems. Ling Liu and M. Tamer Ozsu, ed. 2009..
[pdf]
- Data Compression in Sensor Networks;
Amol Deshpande;
Chapter in Encyclopedia of Database Systems. Ling Liu and M. Tamer Ozsu, ed. 2009..
[pdf]
- Exploiting Shared Correlations in Probabilistic Databases;
Prithviraj Sen, Amol Deshpande, and Lise Getoor;
VLDB 2008.
[pdf]
[abstract]
There has been a recent surge in work in probabilistic databases, propelled in large part by the huge increase in noisy data sources --- from sensor data, experimental data, data from uncurated sources, and many others. There is a growing need for database management systems that can efficiently represent and query such data. In this work, we show how data characteristics can be leveraged to make the query evaluation process more efficient. In particular, we exploit what we refer to as "shared correlations" where the same uncertainties and correlations occur repeatedly in the data. Shared correlations occur mainly due to two reasons: (1) Uncertainty and correlations usually come from general statistics and rarely vary on a tuple-to-tuple basis; (2) The query evaluation procedure itself tends to re-introduce the same correlations. Prior work has shown that the query evaluation problem on probabilistic databases is equivalent to a probabilistic inference problem on an appropriately constructed probabilistic graphical model (PGM). We leverage this by introducing a new data structure, called the "random variable elimination graph" (rv-elim graph) that can be built from the PGM obtained from query evaluation. We develop techniques based on bisimulation that can be used to compress the rv-elim graph exploiting the presence of shared correlations in the PGM, the compressed rv-elim graph can then be used to run inference. We validate our methods by evaluating them empirically and show that even with a few shared correlations significant speed-ups are possible.
- Energy Efficient Monitoring in Sensor Networks;
Amol Deshpande and Samir Khuller and Azarakhsh Malekian and Mohammed Toossi;
LATIN 2008.
[pdf]
[abstract]
In this paper we study a set of problems related to efficient energy management for monitoring applications in wireless sensor networks. We study several generalizations of a basic problem called Set "k"-Cover. The problem can be described as follows: we are given a set of sensors, and a set of regions to be monitored. Each region can be monitored by a subset of sensors. The goal is to partition the sensors into "k" sets (or time-slots), so that by activating the set of sensors in a time-slot, we can maximize coverage of the regions. By activating each sensor in only one of the "k" time slots, we decrease its battery consumption and extend its battery life significantly (by a factor of k). This problem is known to be NP-hard. Our goal is to develop improved approximation algorithms for this problem. Moreover, we are able to demonstrate that this algorithm is practical, and yields almost optimal solutions in practice.
We also consider generalizations of this problem in several different directions. First, we allow each sensor to be active in alpha different sets (time-slots). This means that the battery life is extended by a factor of k/alpha, and allows for a richer space of solutions. We also consider different coverage requirements, such as requiring that the regions be covered in each time slot, or a certain number of regions be monitored in each time slot. In the Set k-Cover formulation, there is no requirement that a region be monitored at all, or in any number of time slots. We develop a randomized rounding algorithm for this problem.
We also consider extensions where each sensor can monitor only a bounded number of regions, and not all the regions adjacent to it. This kind of problem may arise when a sensor has a directional camera, or some other physical constraint might prevent it from monitoring all adjacent regions even when it is active. We develop the first approximation algorithms for this problem.
- Predictive Modeling-based Data Collection in Sensor Networks;
Lidan Wang and Amol Deshpande;
EWSN 2008.
[pdf]
[abstract]
We address the problem of designing practical, energy-efficient protocols for data collection in wireless sensor networks using predictive modeling. Prior work has suggested several approaches to capture and exploit the rich spatio-temporal correlations prevalent in WSNs during data collection. Although shown to be effective in reducing the data collection cost, those approaches use simplistic corelation models and further, ignore many idiosyncrasies of WSNs, in particular the broadcast nature of communication. Our proposed approach is based on approximating the joint probability distribution over the sensors using "undirected graphical models", ideally suited to exploit both the spatial correlations and the broadcast nature of communication. We present algorithms for optimally using such a model for data collection under different communication models, and for identifying an appropriate model to use for a given sensor network. Experiments over synthetic and real-world datasets show that our approach significantly reduces the data collection cost.
- Flow Algorithms for Parallel Query Optimization;
Amol Deshpande, Lisa Hellerstein;
ICDE 2008.
[pdf]
[abstract]
In this paper we address the problem of minimizing the response time of a multi-way join query using pipelined (inter-operator) parallelism, in a parallel or a distributed environment. We observe that in order to fully exploit the parallelism in the system, we must consider a new class of "interleaving" plans, where multiple query plans are used simultaneously to minimize the response time of a query (or maximize the tuple-throughput of the system). We cast the query planning problem in this environment as a "flow maximization problem", and present polynomial-time algorithms that (statically) find the optimal set of plans to use for a large class of multi-way join queries. Our proposed algorithms also naturally extend to query optimization over web services. Finally we present an extensive experimental evaluation that demonstrates both the need to consider such plans in parallel query processing and the effectiveness of our proposed algorithms.
- Online Filtering, Smoothing and Probabilistic Modeling of Streaming data;
Bhargav Kanagal, Amol Deshpande;
ICDE 2008.
[pdf]
[abstract]
In this paper, we address the problem of extending a relational database system to facilitate efficient real-time application of dynamic probabilistic models to streaming data. We use the recently proposed abstraction of model-based views for this purpose, by allowing users to declaratively specify the model to be applied, and by presenting the output of the models to the user as a probabilistic database view. We support declarative querying over such views using an extended version of SQL that allows for querying probabilistic data. Underneath we use particle filters, a class of sequential Monte Carlo algorithms, to represent the present and historical states of the model as sets of weighted samples (particles) that are kept up-to-date as new data arrives. We develop novel techniques to convert the queries on the model-based view directly into queries over particle tables, enabling highly efficient query processing. Finally, we present experimental evaluation of our prototype implementation over several synthetic and real datasets, that demonstrates the feasibility of online modeling of streaming data using our system and establishes the advantages of tight integration between dynamic probabilistic models and databases.
- Network-Aware Join Processing in Global-Scale Database Federations;
Xiaodon Wang, Randal Burns, Andreas Terzis, Amol Deshpande;
ICDE 2008.
[pdf]
[abstract]
We introduce join scheduling algorithms that employ a balanced network utilization metric to optimize the use of all network paths in a global-scale database federation. This metric allows algorithms to exploit excess capacity in the network, while avoiding narrow, long-haul paths. We give a two-approximate, polynomial-time algorithm for serial (left-deep) join schedules. We also present extensions to this algorithm that explore parallel schedules, reduce resource usage, and define trade-offs between computation and network utilization. We evaluate these techniques within the SkyQuery federation of Astronomy databases using spatial-join queries submitted by SkyQuery's users. Experiments show that our algorithms realize near-optimal network utilization with minor computational overhead.
- Model-based Approximate Querying in Sensor Networks;
Amol Deshpande, Carlos Guestrin, Sam Madden, Joseph M. Hellerstein, Wei Hong;
International Journal on Very Large Data Bases (VLDB Journal), 2005.
[pdf]
[abstract]
Declarative queries are proving to be an attractive paradigm for interacting with networks of wireless sensors. The metaphor that "the sensornet is a database" is problematic, however, because sensors do not exhaustively represent the data in the real world. In order to map the raw sensor readings onto physical reality, a "model" of that reality is required to complement the readings. In this article, we enrich interactive sensor querying with statistical modeling techniques. We demonstrate that such models can help provide answers that are both more meaningful, and, by introducing approximations with probabilistic confidences, significantly more efficient to compute in both time and energy. Utilizing the combination of a model and live data acquisition raises the challenging optimization problem of selecting the best sensor readings to acquire, balancing the increase in the confidence of our answer against the communication and data acquisition costs in the network. We describe an exponential time algorithm for finding the optimal solution to this optimization problem, and a polynomial-time heuristic for identifying solutions that perform well in practice. We evaluate our approach on several real-world sensor-network data sets, taking into account the real measured data and communication quality, demonstrating that our model-based approach provides a high-fidelity representation of the real phenomena and leads to significant performance gains versus traditional data acquisition techniques.
- Toward On-line Schema Evolution for Non-stop Systems;
Amol Deshpande, Michael Hicks;
HPTS 2005.
- Resource-Aware Wireless Sensor-Actuator Networks;
Amol Deshpande, Carlos Guestrin, Sam Madden;
IEEE Data Engineering Bulletin March 2005.
[pdf]
[abstract]
Innovations in wireless sensor networks (WSNs) have dramatically expanded the applicability of control technology in day-to-day life, by enabling the cost-effective deployment of large scale sensor-actuator systems. In this paper, we discuss the issues and challenges involved in deploying control-oriented applications over unreliable, resource-constrained WSNs, and describe the design of our planned Sensor Control System (SCS) that can enable the rapid development and deployment of such applications.
- Exploiting Correlated Attributes in Acquisitional Query Processing;
Amol Deshpande, Carlos Guestrin, Sam Madden, Wei Hong;
ICDE 2005.
[pdf]
[abstract]
Sensor networks and other distributed information systems (such as the Web) must frequently access data that has a high per-attribute "acquisition" cost, in terms of energy, latency, or computational resources. When executing queries that contain several predicates over such expensive attributes, we observe that it can be beneficial to use correlations to automatically introduce low-cost attributes whose observation will allow the query processor to better estimate the selectivity of these expensive predicates. In particular, we show how to build "conditional plans" that branch into one or more sub-plans, each with a different ordering for the expensive query predicates, based on the runtime observation of low-cost attributes. We frame the problem of constructing the optimal conditional plan for a given user query and set of candidate low-cost attributes as an optimization problem. We describe an exponential time algorithm for finding such optimal plans, and describe a polynomial-time heuristic for identifying conditional plans that perform well in practice. We also show how to compactly model conditional probability distributions needed to identify correlations and build these plans. We evaluate our algorithms against several real-world sensor-network data sets, showing several-times performance increases for a variety of queries versus traditional optimization techniques.
- Using Probabilistic Models for Data Management in Acquisitional Environments;
Amol Deshpande, Carlos Guestrin, Sam Madden;
CIDR 2005.
[pdf]
[abstract]
Traditional database systems, particularly those focused on capturing and managing data from the real world, are poorly equipped to deal with the noise, loss, and uncertainty in data. We discuss a suite of techniques based on probabilistic models that are designed to allow database to tolerate noise and loss. These techniques are based on exploiting correlations to predict missing values and identify outliers. Interestingly, correlations also provide a way to give approximate answers to users at a significantly lower cost and enable a range of new types of queries over the correlation structure itself. We illustrate a host of applications for our new techniques and queries, ranging from sensor networks to network monitoring to data stream management. We also present a unified architecture for integrating such models into database systems, focusing in particular on "acquisitional systems" where the cost of capturing data (e.g., from sensors) is itself a significant part of the query processing cost.