Robust Uncertain Data Management

The goal of this project is to develop a systematic framework to enable "robust" query processing in presence of data uncertainties, that arise naturally in a wide variety of application domains including social media analysis, scientific and biological data management, sensor data management, and text analytics. Data uncertainties may take the form of missing or incomplete data, inherent noise in the data, trust or reputation scores assigned to data based on their sources of origin, or confidences in predictions made using automated modeling tools. The input uncertainties naturally lead to uncertainties in the results of any queries or analysis performed on such data. To enable robust and systematic reasoning over such uncertain query results, we are designing algorithms and tools to identify the input uncertainties to which the query results are most sensitive, to decide how to use scarce resources like subject matter experts to resolve uncertainties in query results, and to incorporate user feedback to improve the robustness of the input uncertainty parameters themselves.

Project Participants

Faculty: Amol Deshpande, Lisa Hellerstein
Students/Postdocs: Dimitrios Gkenosis, Patrick Lin, Devorah Kletenik, Nathaniel Grammel, Theodoros Rekatsinas

Publications

Solving Zero-sum Games using Best Response Oracles with Applications to Search Games;
Lisa Hellerstein, Thomas Lidbetter, Daniel Pirutinsky;
CoRR Technical Report arXiv:1704.02657, 2017.
Submodular goal value of Boolean functions [pdf];
Eric Bach, Jeremie Dusart, Lisa Hellerstein, Devorah Kletenik;
Discrete Applied Mathematics 238: 1-13 (also CoRR Technical Report arXiv:1702.04067), 2018.
The Stochastic Score Classification Problem [pdf];
Dimitrios Gkenosis, Nathaniel Grammel, Lisa Hellerstein, Devorah Kletenik;
Proceedings of the European Symposium on Algorithms (ESA) 36:1-14, 2018 (also CoRR Technical Report arXiv:1806.10660), 2018.
Approximation algorithms for stochastic submodular set cover with applications to Boolean function evaluation and min-knapsack;
Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik;
ACM Transacations on Algorithms 12:3 (42), 2016.
Scenario Submodular Cover;
Nathaniel Grammel, Lisa Hellerstein, Devorah Kletenik, and Patrick Lin;
Workshop on Approximation and Online Algorithms (WAOA), 2016, Revised Selected Papers (also CoRR Technical Report arXiv:1603.03158) .
Evaluation of monotone DNF Formulas;
Sarah R. Allen, Lisa Hellerstein, Devorah Kletenik, and Tonguc Unluyurt;
Algorithmica 77(3): 661-685, 2017.
CrowdGather: Entity Extraction over Structured Domains;
Theodoros Rekatsinas, Amol Deshpande, Aditya Parameswaran;
CoRR Technical Report arXiv:1502.06823, 2015. [abstract]
Crowdsourced entity extraction is often used to acquire data for many applications, including recommendation systems, construction of aggregated listings and directories, and knowledge base construction. Current solutions focus on entity extraction using a single query, e.g., only using "give me another restaurant", when assembling a list of all restaurants. Due to the cost of human labor, solutions that focus on a single query can be highly impractical. In this paper, we leverage the fact that entity extraction often focuses on {m structured domains}, i.e., domains that are described by a collection of attributes, each potentially exhibiting hierarchical structure. Given such a domain, we enable a richer space of queries, e.g., "give me another Moroccan restaurant in Manhattan that does takeout". Naturally, enabling a richer space of queries comes with a host of issues, especially since many queries return empty answers. We develop new statistical tools that enable us to reason about the gain of issuing {m additional queries} given little to no information, and show how we can exploit the overlaps across the results of queries for different points of the data domain to obtain accurate estimates of the gain. We cast the problem of {m budgeted entity extraction} over large domains as an adaptive optimization problem that seeks to maximize the number of extracted entities, while minimizing the overall extraction costs. We evaluate our techniques with experiments on both synthetic and real-world datasets, demonstrating a yield of up to 4X over competing approaches for the same budget.
Discrete Stochastic Submodular Maximization: Adaptive vs. Non-Adaptive vs. Offline;
Lisa Hellerstein, Devorah Kletenik, and Patrick Lin;
International Conference on Algorithms and Complexity (CIAC), 2015 (also CoRR Technical Report arXiv:1504.02146). [pdf]
Evaluation of DNF Formulas (on-line extended abstract);
Sarah R. Allen, Lisa Hellerstein, Devorah Kletenik, and Tonguc Unluyurt;
International Symposium on Artificial Intelligence and Mathematics (also CoRR Technical Report arXiv:1310.3673), 2014.
Boolean function evaluation over a sample;
Lisa Hellerstein, Devorah Kletenik, and Patrick Lin;
NIPS Workshop on Discrete and Combinatorial Problems in Machine Learning (DISCML), 2014. [pdf]
Algorithms and structural results for DNF and set cover problems;
Devorah Kletenik;
PhD Dissertation, Polytechnic University, 2014. [pdf]
Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover;
Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik;
SODA 2014 (also CoRR Technical Report arXiv:1303.0726). [pdf] [abstract]
Stochastic Boolean Function Evaluation is the problem of determining the value of a given Boolean function f on an unknown input x, when each bit of x_i of x can only be determined by paying an associated cost c_i. The assumption is that x is drawn from a given product distribution, and the goal is to minimize the expected cost. This problem has been studied in Operations Research, where it is known as "sequential testing" of Boolean functions. It has also been studied in learning theory in the context of learning with attribute costs. We consider the general problem of developing approximation algorithms for Stochastic Boolean Function Evaluation. We give a 3-approximation algorithm for evaluating Boolean linear threshold formulas. We also present an approximation algorithm for evaluating CDNF formulas (and decision trees) achieving a factor of O(log kd), where k is the number of terms in the DNF formula, and d is the number of clauses in the CNF formula. In addition, we present approximation algorithms for simultaneous evaluation of linear threshold functions, and for ranking of linear functions. Our function evaluation algorithms are based on reductions to the Stochastic Submodular Set Cover (SSSC) problem. This problem was introduced by Golovin and Krause. They presented an approximation algorithm for the problem, called Adaptive Greedy. Our main technical contribution is a new approximation algorithm for the SSSC problem, which we call Adaptive Dual Greedy. It is an extension of the Dual Greedy algorithm for Submodular Set Cover due to Fujito, which is a generalization of Hochbaum's algorithm for the classical Set Cover Problem. We also give a new bound on the approximation achieved by the Adaptive Greedy algorithm of Golovin and Krause.
Approximation algorithms for reducing classification costs in ensembles of classifiers;
Sarah R. Allen, and Lisa Hellerstein;
NIPS Workshop on Discrete and Combinatorial Problems in Machine Learning (DISCML), 2013. [pdf]
Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems;
Jian Li, and Amol Deshpande;
FOCS 2011 (also CoRR Technical Report arXiv:1012.3189). [pdf] [abstract]
We study the stochastic versions of a broad class of combinatorial problems where the weights of the elements in the input dataset are uncertain. The class of problems that we study includes shortest paths, minimum weight spanning trees, and minimum weight matchings over probabilistic graphs, and other combinatorial problems like knapsack. We observe that the expected value is inadequate in capturing different types of "risk-averse" or "risk-prone" behaviors, and instead we consider a more general objective which is to maximize the "expected utility" of the solution for some given utility function, rather than the expected weight (expected weight becomes a special case). We show that we can obtain a polynomial time approximation algorithm with "additive error" "epsilon" for any "epsilon>0", if there is a pseudopolynomial time algorithm for the "exact" version of the problem. (This is true for the problems mentioned above). Our result generalizes several prior results on stochastic shortest path, stochastic spanning tree, and stochastic knapsack. Our algorithm for utility maximization makes use of the separability of exponential utility and a technique to decompose a general utility function into exponential utility functions, which may be useful in other stochastic optimization problems.
Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases;
Bhargav Kanagal, Jian Li, Amol Deshpande;
SIGMOD 2011. [pdf] [abstract]
Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to provide "explanations" for query results, e.g., ``Why is this tuple in my result?'' or ``Why does this output tuple have such high probability?''. Second, the problem of determining the "sensitive" input tuples for the given query, e.g., users are interested to know the input tuples that can substantially alter the output, when their probabilities are modified (since they may be unsure about the input probability values). Existing systems provide the "lineage/provenance" of each of the output tuples in addition to the output probabilities, which is a boolean formula indicating the dependence of the output tuple on the input tuples. However, lineage does not immediately provide a quantitative relationship and it is not informative when we have multiple output tuples. In this paper, we propose a unified framework that can handle both the issues mentioned above to facilitate robust query processing. We formally define the notions of "influence" and "explanations" and provide algorithms to determine the top-l influential set of variables and the top-l set of explanations for a variety of queries, including "conjunctive" queries, "probabilistic threshold" queries, "top-k" queries and "aggregation" queries. Further, our framework naturally enables highly efficient incremental evaluation when input probabilities are modified (e.g., if uncertainty is resolved). Our preliminary experimental results demonstrate the benefits of our framework for performing robust query processing over probabilistic databases.

Acknowledgments

This material is based upon work supported in part by the National Science Foundation under Grants 0916736, 0917153, 1217968, and 1218367. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.