Cloud Computing

Our work on cloud computing has spanned a spectrum of issues including data placement and replication to minimize resource consumption in cloud data management systems, scheduling in data centers, privacy, and cloud-based graph analytics. Please contact me if you would like more information about any of this work.

Publications

NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
To appear in VLDB Journal (also CoRR Technical Report arXiv:1405.1499). [pdf] [abstract]
There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego network analysis, motif counting, finding social circles, personalized recommendations, analyzing influence cascades, etc. These tasks are not well served by the existing vertex-centric graph processing frameworks, whose computation, execution models limit the user program to directly access the state of a single vertex; this results in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks typically ignore the challenges in extracting the relevant portion of the graph that an analysis task needs, and loading it onto distributed memory. In this paper, we describe NScale, a novel end-to-end graph processing framework that enables the distributed execution of complex subgraph-centric analytics over large-scale graphs in the cloud. NScale enables users to write programs at the level of subgraphs, and to specify the subgraphs of interest declaratively. NScale uses Apache YARN, a state-of-the-art framework for efficient and fault-tolerant distribution of data and computation. It features GEL, a novel graph extraction and loading phase, that extracts the relevant portions of the graph and utilizes a cost-based optimizer to partition and load the graph onto distributed memory using as few machines as possible to minimize the communication cost. It utilizes novel techniques for the distributed execution of user computation that minimize memory consumption by exploiting overlap among the subgraphs of interest. Our experimental results show orders-of-magnitude improvements in performance, and drastic reductions in the cost of analytics, over vertex-centric approaches.
DataHub: Collaborative data science and dataset version management at scale;
Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J. Elmore, Samuel Madden, Aditya Parameswaran;
CIDR 2015. [pdf] [abstract]
Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.
Scalable Progressive Analytics on Big Data in the Cloud;
Badrish Chandramouli, Jonathan Goldstein, Abdul Quamar;
VLDB 2014. [abstract]
Analytics over the increasing quantity of data stored in the Cloud has become very expensive, particularly due to the pay-as-you-go Cloud computation model. Data scientists typically manually extract samples of increasing data size (progressive samples) using domain-specific sampling strategies for exploratory querying. This provides them with user-control, repeatable semantics, and result provenance. However, such solutions result in tedious workflows that preclude the reuse of work across samples. On the other hand, existing approximate query processing systems report early results, but do not offer the above benefits for complex ad-hoc queries. We propose a new progressive analytics system based on a progress model called Prism that (1) allows users to communicate progressive samples to the system; (2) allows efficient and deterministic query processing over samples; and (3) provides repeatable semantics and provenance to data scientists. We show that one can realize this model for atemporal relational queries using an unmodified temporal streaming engine, by re-interpreting temporal event fields to denote progress. Based on Prism, we build Now!, a progressive data-parallel computation framework for Windows Azure, where progress is understood as a first-class citizen in the framework. Now! works with 'progress-aware reducers'- in particular, it works with streaming engines to support progressive SQL over big data. Extensive experiments on Windows Azure with real and synthetic workloads validate the scalability and benefits of Now! and its optimizations, over current solutions for progressive analytics.
VERTEXICA: Your Relational Friend for Graph Analytics!;
Alekh Jindal, Praynaa Rawlani, Eugene Wu, Samuel Madden, Amol Deshpande, Mike Stonebraker;
VLDB Demo 2014. [pdf] [abstract]
In this paper, we present Vertexica, a graph analytics tools on top of a relational database, which is user friendly and yet highly efficient. Instead of constraining programmers to SQL, Vertexica offers a popular vertex-centric query interface, which is more natural for analysts to express many graph queries. The programmers simply provide their vertex-compute functions and Vertexica takes care of efficiently executing them in the standard SQL engine. The advantage of using Vertexica is its ability to leverage the relational features and enable much more sophisticated graph analysis. These include expressing graph algorithms which are difficult in vertex-centric but straightforward in SQL and the ability to compose end-to-end data processing pipelines, including pre- and post-processing of graphs as well as combining multiple algorithms for deeper insights. Vertexica has a graphical user interface and we outline several demonstration scenarios including, interactive graph analysis, complex graph analysis, and continuous and time series analysis.
NScale: Neighborhood-centric Analytics on Large Graphs;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
VLDB Demo 2014. [pdf] [abstract]
There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego network analysis, motif counting in biological networks, finding social circles, personalized recommendations, link prediction, anomaly detection, analyzing influence cascades, and so on. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks also typically ignore the challenges in extracting the relevant portions of the graph that an analysis task is interested in, and loading it onto distributed memory. In this demonstration proposal, we describe NScale, a novel end-to-end graph processing framework that enables the distributed execution of complex neighborhood-centric analytics over large-scale graphs in the cloud. NScale enables users to write programs at the level of neighborhoods or subgraphs. NScale uses Apache YARN for efficient and fault-tolerant distribution of data and computation; it features GEL, a novel graph extraction and loading phase, that extracts the relevant portions of the graph and loads them into distributed memory using as few machines as possible. NScale utilizes novel techniques for the distributed execution of user computation that minimize memory consumption by exploiting overlap among the neighborhoods of interest. A comprehensive experimental evaluation shows orders-of-magnitude improvements in performance and total cost over vertex-centric approaches.
SWORD: Workload-aware Data Placement and Replica Selection for Cloud Data Management Systems;
Ashwin Kumar Kayyoor, Abdul Quamar, Amol Deshpande, Samir Khuller;
VLDB Journal (Special Issue on Data-Intensive Cloud Infrastructure), 23(6): 845-870, 2014. [pdf] [abstract]
Cloud computing is increasingly being seen as a way to reduce infrastructure costs and add elasticity, and is being used by a wide range of organizations. Cloud data management systems today need to serve a range of different workloads, from analytical read-heavy workloads to transactional (OLTP) workloads. For both the service providers and the users, it is critical to minimize the consumption of resources like CPU, memory, communication bandwidth, and energy, without compromising on service-level agreements if any. In this article, we develop a workload-aware data placement and replication approach, called SWORD, for minimizing resource consumption in such an environment. Specifically, we monitor and model the expected workload as a hypergraph and develop partitioning techniques that minimize the average query span, i.e., the average number of machines involved in the execution of a query or a transaction. We empirically justify the use of query span as the metric to optimize, for both analytical and transactional workloads, and develop a series of replication and data placement algorithms by drawing connections to several well-studied graph theoretic concepts. We introduce a suite of novel techniques to achieve high scalability by reducing the overhead of partitioning and query routing. To deal with workload changes, we propose an incremental repartitioning technique that modifies data placement in small steps without resorting to complete repartitioning. We propose the use of fine-grained quorums defined at the level of groups of data items to control the cost of distributed updates, improve throughput, and adapt to different workloads. We empirically illustrate the benefits of our approach through a comprehensive experimental evaluation for two classes of workloads. For analytical read-only workloads, we show that our techniques result in significant reduction in total resource consumption. For OLTP workloads, we show that our approach improves transaction latencies and overall throughput by minimizing the number of distributed transactions.
PStore: An Efficient Storage Framework for Managing Scientific Data;
Souvik Bhattacherjee, Amol Deshpande, and Alan Sussman;
SSDBM 2014. [abstract]
In this paper, we present the design, implementation, and evaluation of PStore, a no-overwrite storage framework for managing large volumes of array data generated by scientific simulations. PStore comprises of two modules, a data ingestion module and a query processing module, that respectively address two of the key challenges in scientific simulation data management. The data ingestion module is geared toward handling the high volumes of simulation data generated at a very rapid rate, which often makes it impossible to offload the data onto storage devices; the module is responsible for selecting an appropriate compression scheme for the data at hand, chunking the data, and then compressing it before sending it to the storage nodes. On the other hand, the query processing module is in charge of efficiently executing different types of queries over the stored data; in this paper, we specifically focus on slice (also called range) queries. PStore provides a suite of compression schemes that leverage existing techniques while extending some of them to provide support for diverse scientific simulation data. To efficiently execute queries over such compressed data, PStore adopts and extends a two-level chunking scheme by incorporating the effect of compression, and hides expensive disk latencies for long running range queries by exploiting chunk prefetching. In addition, we also parallelize the query processing module to further speed up execution. We evaluate PStore on a 140 GB dataset obtained from real-world simulations using the regional climate model CWRF. In this paper, we use both 3D and 4D datasets and demonstrate high performance through extensive experiments.
Optimization Techniques for "Scaling Down" Hadoop on Multi-Core, Shared-Memory Systems;
K. Ashwin Kumar, Jonathan Gluck, Amol Deshpande, Jimmy Lin;
EDBT 2014. [pdf] [abstract]
The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale datastores are increasingly common, it is unclear whether ``typical'' analytics tasks require more than a single high-end server. Additionally, we are seeing increased sophistication in analytics, e.g., machine learning, which generally operate over smaller and more refined datasets. To address these trends, we propose ``scaling down'' Hadoop to run on shared-memory machines. This paper presents a prototype runtime called Hone, intended to be both API and binary compatible with standard (distributed) Hadoop. That is, Hone can take an existing Hadoop jar and run it, without modification, on a multi-core shared-memory machine. This allows us to take existing Hadoop algorithms and find the most suitable runtime environment for execution on datasets of varying sizes. Our experiments show that Hone order of magnitude faster than Hadoop pseudo-distributed mode (PDM); on dataset sizes that fit into memory, Hone outperforms a fully-distributed 15-node Hadoop cluster in some cases as well.
SPARSI: Partitioning Sensitive Data amongst Multiple Adversaries;
Theodoros Rekatsinas, Amol Deshpande, and Ashwin Machanavajjhala;
PVLDB 2013, to be presented at VLDB 2014 (also CoRR Technical Report arXiv:1302.6556). [pdf] [abstract]
We present SPARSI, a theoretical framework for partitioning sensitive data across multiple non-colluding adversaries. Most work in privacy-aware data sharing has considered disclosing summaries where the aggregate information about the data is preserved, but sensitive user information is protected. Nonetheless, there are applications, including online advertising, cloud computing and crowdsourcing markets, where detailed and fine-grained user-data must be disclosed. We consider a new data sharing paradigm and introduce the problem of privacy-aware data partitioning, where a sensitive dataset must be partitioned among k untrusted parties (adversaries). The goal is to maximize the utility derived by partitioning and distributing the dataset, while minimizing the amount of sensitive information disclosed. The data should be distributed so that an adversary, without colluding with other adversaries, cannot draw additional inferences about the private information, by linking together multiple pieces of information released to her. The assumption of no collusion is both reasonable and necessary in the above application domains that require release of private user information. SPARSI enables us to formally define privacy-aware data partitioning using the notion of sensitive properties for modeling private information and a hypergraph representation for describing the interdependencies between data entries and private information. We show that solving privacy-aware partitioning is, in general, NP-hard, but for specific information disclosure functions, good approximate solutions can be found using relaxation techniques. Finally, we present a local search algorithm applicable to generic information disclosure functions. We apply SPARSI together with the proposed algorithms on data from a real advertising scenario and show that we can partition data with no disclosure to any single advertiser.
Hone: "Scaling Down" Hadoop on Shared-Memory Systems;
K. Ashwin Kumar, Jonathan Gluck, Amol Deshpande, Jimmy Lin;
VLDB Demo 2013. [pdf] [abstract]
The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale data-stores are increasingly common, it is unclear whether "typical" analytics tasks require more than a single high-end server. Additionally, we are seeing increased sophistication in analytics, e.g., machine learning, which generally operates over smaller and more refined datasets. To address these trends, we propose "scaling down" Hadoop to run on shared-memory machines. This paper presents a prototype runtime called Hone, intended to be both API and binary compatible with standard (distributed) Hadoop. That is, Hone can take an existing Hadoop jar and efficiently execute it, without modification, on a multi-core shared memory machine. This allows us to take existing Hadoop algorithms and find the most suitable run-time environment for execution on datasets of varying sizes. Our experiments show that Hone can be an order of magnitude faster than Hadoop pseudo-distributed mode (PDM); on dataset sizes that fit into memory, Hone can outperform a fully-distributed 15-node Hadoop cluster in some cases as well.
Data Placement and Replica Selection for Improving Co-location in Distributed Environments;
K. Ashwin Kumar, Amol Deshpande, and Samir Khuller;
CoRR Technical Report arXiv:1302.4168. [pdf] [abstract]
Increasing need for large-scale data analytics in a number of application domains has led to a dramatic rise in the number of distributed data management systems, both parallel relational databases, and systems that support alternative frameworks like MapReduce. There is thus an increasing contention on scarce data center resources like network bandwidth; further, the energy requirements for powering the computing equipment are also growing dramatically. As we show empirically, increasing the execution parallelism by spreading out data across a large number of machines may achieve the intended goal of decreasing query latencies, but in most cases, may increase the total resource and energy consumption significantly. For many analytical workloads, however, minimizing query latencies is often not critical; in such scenarios, we argue that we should instead focus on minimizing the average query span, i.e., the average number of machines that are involved in processing of a query, through colocation of data items that are frequently accessed together. In this work, we exploit the fact that most distributed environments need to use replication for fault tolerance, and we devise workload-driven replica selection and placement algorithms that attempt to minimize the average query span. We model a historical query workload trace as a hypergraph over a set of data items, and formulate and analyze the problem of replica placement by drawing connections to several well-studied graph theoretic concepts. We develop a series of algorithms to decide which data items to replicate, and where to place the replicas. We show effectiveness of our proposed approach by presenting results on a collection of synthetic and real workloads. Our experiments show that careful data placement and replication can dramatically reduce the average query spans resulting in significant reductions in the resource consumption.
Algorithms for the Thermal Scheduling Problem;
Koyel Mukherjee, Samir Khuller, and Amol Deshpande;
IPDPS 2013. [pdf] [abstract]
The energy costs for cooling a data center constitute a significant portion of the overall running costs. Thermal imbalance and hot spots that arise due to imbalanced workloads lead to significant wasted cooling effort -- in order to ensure that no equipment is operating above a certain temperature, the data center may be cooled more than necessary. Therefore it is desirable to schedule the workload in a data center in a "thermally aware" manner, assigning jobs to machines not just based on local load of the machines, but based on the overall thermal profile of the data center. This is challenging because of the spatial cross-interference between machines, where a job assigned to a machine may impact not only that machine's temperature, but also nearby machines. Here, we continue formal analysis of the "thermal scheduling" problem that we initiated recently. There we introduced the notion of "effective load of a machine" which is a function of the local load on the machine as well as the load on nearby machines, and presented optimal scheduling policies for a simple model (where cross-effects are restricted within a rack) under the assumption that jobs can be split among different machines. Here we consider the more realistic problem of "integral" assignment of jobs, and allow for cross-interference among different machines in adjacent racks in the data center. The integral assignment problem with cross-interference is NP-hard, even for a simple two machine model. We consider three different heat flow models, and give constant factor approximation algorithms for maximizing the number (or total profit) of jobs assigned in each model, without violating thermal constraints. We also consider the problem of minimizing the maximum temperature on any machine when all jobs need to be assigned, and give constant factor algorithms for this problem.
SWORD: Scalable Workload-Aware Data Placement for Transactional Workloads;
Abdul Quamar, K.Ashwin Kumar and Amol Deshpande;
EDBT 2013. [pdf] [abstract]
In this paper, we address the problem of transparently scaling out transactional (OLTP) workloads on relational databases, to support "database-as-a-service" in cloud computing environment. The primary challenges in supporting such workloads include choosing how to "partition" the data across a large number of machines, minimizing the number of "distributed transactions", providing high data availability, and tolerating failures gracefully. Capturing and modeling the transactional workload over a period of time, and then exploiting that information for data placement and replication has been shown to provide significant benefits in performance, both in terms of transaction latencies and overall throughput. However, such workload-aware data placement approaches can incur very high overheads, and further, may perform worse than naive approaches if the workload changes. In this work, we propose SWORD, a scalable workload-aware data partitioning and placement approach for OLTP workloads, that incorporates a suite of novel techniques to significantly reduce the overheads incurred both during the initial placement, and during query execution at runtime. We model the workload as a hypergraph over the data items, and propose using a "hypergraph compression" technique to reduce the overheads of partitioning. To deal with workload changes, we propose an incremental data repartitioning technique that modifies data placement in small steps without resorting to complete workload repartitioning. We have built a workload-aware "active replication" mechanism in SWORD to increase availability and enable load balancing. We propose the use of "fine-grained quorums" defined at the level of groups of tuples to control the cost of distributed updates, improve throughput, and provide adaptability to different workloads. To our knowledge, SWORD is the first system that uses fine-grained quorums in this context. The results of our experimental evaluation on SWORD deployed on an Amazon EC2 cluster show that our techniques result in orders-of-magnitude reductions in the partitioning and book-keeping overheads, and improve tolerance to failures and workload changes; we also show that choosing quorums based on the query access patterns enables us to better handle query workloads with different read and write access patterns.
Saving on Cooling: The Thermal Scheduling Problem;
Koyel Mukherjee, Samir Khuller, and Amol Deshpande;
SIGMETRICS 2012 (Short paper). [abstract]
In this abstract we define some very basic scheduling problems motivated by increasing power density and consequent cooling considerations in data centers and multi-core chips.

Acknowledgments

This material is based upon work supported in part by the National Science Foundation under Grant 0937865. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.