Robust Uncertain Data Management

The goal of this project is to develop a systematic framework to enable "robust" query processing in presence of data uncertainties, that arise naturally in a wide variety of application domains including social media analysis, scientific and biological data management, sensor data management, and text analytics. Data uncertainties may take the form of missing or incomplete data, inherent noise in the data, trust or reputation scores assigned to data based on their sources of origin, or confidences in predictions made using automated modeling tools. The input uncertainties naturally lead to uncertainties in the results of any queries or analysis performed on such data. To enable robust and systematic reasoning over such uncertain query results, we are designing algorithms and tools to identify the input uncertainties to which the query results are most sensitive, to decide how to use scarce resources like subject matter experts to resolve uncertainties in query results, and to incorporate user feedback to improve the robustness of the input uncertainty parameters themselves.

Project Participants

Faculty: Amol Deshpande, Lisa Hellerstein
Students/Postdocs: Patrick Lin, Devorah Kletenik

Publications

;
;
(also CoRR Technical Report arXiv:1310.3673), 2014.
Submodular goal value of Boolean functions [pdf];
Eric Bach, Jérémie Dusart, Lisa Hellerstein, Devorah Kletenik;
Discrete Applied Mathematics 238: 1-13 (also CoRR Technical Report arXiv:1702.04067), 2018.
The Stochastic Score Classification Problem. Proceedings of the European [pdf];
Dimitrios Gkenosis, Nathaniel Grammel, Lisa Hellerstein, Devorah Kletenik;
Symposium on Algorithms (ESA) 36:1-14, 2018 (also CoRR Technical Report arXiv:1806.10660), 2018.
Evaluation of DNF Formulas (on-line extended abstract);
Sarah R. Allen, Lisa Hellerstein, Devorah Kletenik, and Tonguc Unluyurt;
International Symposium on Artificial Intelligence and Mathematics (also CoRR Technical Report arXiv:1310.3673), 2014.
Boolean function evaluation over a sample;
Lisa Hellerstein, Devorah Kletenik, and Patrick Lin;
NIPS Workshop on Discrete and Combinatorial Problems in Machine Learning (DISCML), 2014.
Algorithms and structural results for DNF and set cover problems;
Devorah Kletenik;
PhD Dissertation, Polytechnic University, 2014.
Approximation Algorithms for Stochastic Boolean Function Evaluation and Stochastic Submodular Set Cover;
Amol Deshpande, Lisa Hellerstein, and Devorah Kletenik;
SODA 2014 (also CoRR Technical Report arXiv:1303.0726). [abstract]
Stochastic Boolean Function Evaluation is the problem of determining the value of a given Boolean function f on an unknown input x, when each bit of x_i of x can only be determined by paying an associated cost c_i. The assumption is that x is drawn from a given product distribution, and the goal is to minimize the expected cost. This problem has been studied in Operations Research, where it is known as "sequential testing" of Boolean functions. It has also been studied in learning theory in the context of learning with attribute costs. We consider the general problem of developing approximation algorithms for Stochastic Boolean Function Evaluation. We give a 3-approximation algorithm for evaluating Boolean linear threshold formulas. We also present an approximation algorithm for evaluating CDNF formulas (and decision trees) achieving a factor of O(log kd), where k is the number of terms in the DNF formula, and d is the number of clauses in the CNF formula. In addition, we present approximation algorithms for simultaneous evaluation of linear threshold functions, and for ranking of linear functions. Our function evaluation algorithms are based on reductions to the Stochastic Submodular Set Cover (SSSC) problem. This problem was introduced by Golovin and Krause. They presented an approximation algorithm for the problem, called Adaptive Greedy. Our main technical contribution is a new approximation algorithm for the SSSC problem, which we call Adaptive Dual Greedy. It is an extension of the Dual Greedy algorithm for Submodular Set Cover due to Fujito, which is a generalization of Hochbaum's algorithm for the classical Set Cover Problem. We also give a new bound on the approximation achieved by the Adaptive Greedy algorithm of Golovin and Krause.
Approximation algorithms for reducing classification costs in ensembles of classifiers;
Sarah R. Allen, and Lisa Hellerstein;
NIPS Workshop on Discrete and Combinatorial Problems in Machine Learning (DISCML), 2013.
Maximizing Expected Utility for Stochastic Combinatorial Optimization Problems;
Jian Li, and Amol Deshpande;
FOCS 2011 (also CoRR Technical Report arXiv:1012.3189). [abstract]
We study the stochastic versions of a broad class of combinatorial problems where the weights of the elements in the input dataset are uncertain. The class of problems that we study includes shortest paths, minimum weight spanning trees, and minimum weight matchings over probabilistic graphs, and other combinatorial problems like knapsack. We observe that the expected value is inadequate in capturing different types of "risk-averse" or "risk-prone" behaviors, and instead we consider a more general objective which is to maximize the "expected utility" of the solution for some given utility function, rather than the expected weight (expected weight becomes a special case). We show that we can obtain a polynomial time approximation algorithm with "additive error" "epsilon" for any "epsilon>0", if there is a pseudopolynomial time algorithm for the "exact" version of the problem. (This is true for the problems mentioned above). Our result generalizes several prior results on stochastic shortest path, stochastic spanning tree, and stochastic knapsack. Our algorithm for utility maximization makes use of the separability of exponential utility and a technique to decompose a general utility function into exponential utility functions, which may be useful in other stochastic optimization problems.
Sensitivity Analysis and Explanations for Robust Query Evaluation in Probabilistic Databases;
Bhargav Kanagal, Jian Li, Amol Deshpande;
SIGMOD 2011. [abstract]
Probabilistic database systems have successfully established themselves as a tool for managing uncertain data. However, much of the research in this area has focused on efficient query evaluation and has largely ignored two key issues that commonly arise in uncertain data management: First, how to provide "explanations" for query results, e.g., ``Why is this tuple in my result?'' or ``Why does this output tuple have such high probability?''. Second, the problem of determining the "sensitive" input tuples for the given query, e.g., users are interested to know the input tuples that can substantially alter the output, when their probabilities are modified (since they may be unsure about the input probability values). Existing systems provide the "lineage/provenance" of each of the output tuples in addition to the output probabilities, which is a boolean formula indicating the dependence of the output tuple on the input tuples. However, lineage does not immediately provide a quantitative relationship and it is not informative when we have multiple output tuples. In this paper, we propose a unified framework that can handle both the issues mentioned above to facilitate robust query processing. We formally define the notions of "influence" and "explanations" and provide algorithms to determine the top-l influential set of variables and the top-l set of explanations for a variety of queries, including "conjunctive" queries, "probabilistic threshold" queries, "top-k" queries and "aggregation" queries. Further, our framework naturally enables highly efficient incremental evaluation when input probabilities are modified (e.g., if uncertainty is resolved). Our preliminary experimental results demonstrate the benefits of our framework for performing robust query processing over probabilistic databases.

Acknowledgments

This material is based upon work supported in part by the National Science Foundation under Grants 0916736, 0917153, 1217968, and 1218367. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.