Over the last decade, information networks have become ubiquitous and widespread. These include social
networks, communication networks, financial transaction networks, citation networks, gene regulatory
networks, disease transmission networks, ecological food networks, sensor networks, social contact
graphs, and many more. Network data arises even in mundane applications like phone call data,
IP traffic data, or parcel shipment data. Social contact graphs are expected to be available for analysis in near future, and
can potentially be used to gain insights into various social phenomena as well as in disease
outbreak and prevention. There is a growing need for data management systems that can support
real-time ingest, storage, querying, and complex analytics over such network data.
Network data is most naturally represented as a
, with nodes representing the entities and edges denoting the
interactions between them. However, despite much work on graph querying algorithms and graph
programming frameworks in recent years, there is still a lack of established data management systems that
provide declarative frameworks for querying and analyzing such graph-structured data, especially
very large volumes of heterogeneous, complex-structured, and rapidly changing data.
The raw observational network data is also often noisy and needs to cleaned and annotated through use
of statistical models before querying and analysis.
The increasing availability of historical traces of time-evolving graphs
has also opened up opportunities in temporal
evolutionary analysis as well as in data mining and comparative analytics over historical
information. Similarly there is increasing interest in continuous query processing and real-time
analytics, especially anomaly or event detection, on streaming graph data. Further, the graph sizes and the
number of operations that need to be supported are growing at an unprecedented pace, necessitating
use of parallel and distributed solutions, both for efficiency and for better fault-tolerance;
however, graph operations are notoriously hard to parallelize.
In this project, we are building a graph data management system and a suite of tools aimed at supporting real-time and
historical querying and analytics over very large, dynamic, heterogeneous, and noisy graphs. Some of the key
components of our overall system include:
- GraphGen: Adaptive Graph Extraction and Analytics over Relational Databases;
Konstantinos Xirogiannopoulos, Virinchi Srinivas, and Amol Deshpande;
GRADES SIGMOD Workshop 2017.
[pdf]
- Extracting and Analyzing Hidden Graphs from Relational Databases;
Konstantinos Xirogiannopoulos, and Amol Deshpande;
SIGMOD 2017.
[abstract] Analyzing interconnection structures among underlying entities or objects in a dataset through the use of graph analytics has been shown to provide tremendous value in many application domains. However, graphs are not the primary representation choice for storing most data today, and in order to have access to these analyses, users are forced to manually extract data from their data stores, construct the requisite graphs, and then load them into some graph engine in order to execute their graph analysis task. Moreover, in many cases (i.e. when the graphs are dense) these graphs can turn out to be significantly larger than the initial input stored in the database, making it infeasible to construct or analyze such graphs in-memory. In this paper we address both of these challenges by building a system that enables users to "declaratively" specify graph extraction tasks over a relational database schema and then execute graph algorithms on the extracted graphs. We propose a declarative domain specific language for this purpose, and pair it up with a novel "condensed", in-memory representation that significantly reduces the memory footprint of these graphs, permitting analysis of larger-than-memory graphs. We present a general algorithm for creating such a condensed representation for a large class of graph extraction queries against arbitrary schemas. We observe that the condensed representation suffers from a "duplication" issue, that results in inaccuracies for most graph algorithms. We then present a suite of in-memory representations that handle this duplication in different ways and allow "trading off" the memory required and the computational cost for executing different graph algorithms. We also introduce several novel "deduplication" algorithms for removing this duplication in the graph, which are of independent interest for graph compression, and provide a comprehensive experimental evaluation over several real-world and synthetic datasets illustrating these trade-offs.
- Parallel SPARQL Query Optimization;
Buwen Wu, Yongluan Zhou, Hai Jin and Amol Deshpande;
ICDE 2017.
- Continuous Detection of Activity-based Subgraph Patterns on Dynamic Graphs;
Jayanta Mondal and Amol Deshpande;
DEBS 2016.
[abstract] The ability to detect and analyze interesting subgraph patterns on large and dynamic graph-structured data in real time is crucial for many applications. Such patterns often need to reason about how the nodes behave (i.e., the activity component) in the network as well as how they are connected to each other (i.e., the structural component). An example of such an activity-based subgraph pattern is a clique of users in a social network (the structural predicate), who each have posted more than 10 messages in last 2 hours (the activity-based predicate). Detecting such complex events and analyzing them in real time can help with anomaly detection in phone call networks, advertisement targeting in social networks, malware detection in file download graphs, and so on. In this paper, we present CASQD, a system for continuous detection and analysis of such active subgraph pattern queries over large dynamic graphs. Some of key challenges in executing such queries include: handling a wide variety of user-specified activities of interest, low selectivities of activity-based predicates and the resultant exponential search space, and high data rates often seen in these application domains. A key abstraction in CASQD is a notion called graph-view, which acts as an independence layer between the query language and the underlying physical representation of the graph and the active attributes. This abstraction is aimed at simplifying the query language, while empowering the query optimizer. Considering the balance between expressibility (i.e., patterns that cover many real-world use cases) and optimizability of such patterns, we primarily focus on efficient continuous detection of the active regular structures (specifically, active cliques, active stars, and active bi-cliques). We develop a series of optimization techniques including model-based neighborhood explorations, lazy evaluation of the activity predicates, neighborhood-based search space pruning, and others, for efficient query evaluation. We perform a thorough comparative study of the execution strategies under various settings, and show that our system is capable of achieving end-to-end throughputs over 800k/s using a single, powerful machine.
- NScaleSpark: Subgraph-centric Graph Analytics on Apache Spark;
Abdul Quamar and Amol Deshpande;
NDA 2016 (International Workshop on Network Data Analytics, co-located with SIGMOD 2016).
[abstract] In this paper, we describe NScaleSpark, a framework for executing large-scale distributed graph analysis tasks on the Apache Spark platform. NScaleSpark is motivated by the increasing interest in executing rich and complex analysis tasks over large graph datasets. There is much recent work on vertex-centric graph programming frameworks for executing such analysis tasks - these systems espouse a "think-like-a-vertex" (TLV) paradigm, with some example systems being Pregel, Apache Giraph, GPS, Grace, and GraphX (built on top of Apache Spark). However, the TLV paradigm is not suitable for many complex graph analysis tasks that typically require processing of information aggregated over neighborhoods or subgraphs in the underlying graph. Instead, NScaleSpark is based on a "think-like-a-subgraph" paradigm (also recently called "think-like-an-embedding"). Here, the users specify computations to be executed against a large number of multi-hop neighborhoods or subgraphs of the data graph. NScaleSpark builds upon our prior work on the NScale system, which was built on top of the Hadoop MapReduce system. This paper describes how we reimplemented NScale on the Apache Spark platform, and the key challenges therein, and the design decisions we made. NScaleSpark uses a series of RDD transformations to extract and hold the relevant subgraphs in distributed memory with minimal footprint using a cost-based optimizer. Our in-memory graph data structure enables efficient graph computations over large-scale graphs. Our experimental results over several real world data sets and applications show orders-of-magnitude improvement in performance and total cost over GraphX and other vertex-centric approaches.
- Storing and Analyzing Historical Graph Data at Scale;
Udayan Khurana, and Amol Deshpande;
EDBT 2016 (also CoRR Technical Report arXiv:1509.08960).
[abstract] The work on large-scale graph analytics to date has largely focused on the study of static properties of graph snapshots. However, a static view of interactions between entities is often an oversimplification of several complex phenomena like the spread of epidemics, information diffusion, formation of online communities}, and so on. Being able to find temporal interaction patterns, visualize the evolution of graph properties, or even simply compare them across time, adds significant value in reasoning over graphs. However, because of lack of underlying data management support, an analyst today has to manually navigate the added temporal complexity of dealing with large evolving graphs. In this paper, we present a system, called Historical Graph Store, that enables users to store large volumes of historical graph data and to express and run complex temporal graph analytical tasks against that data. It consists of two key components: a Temporal Graph Index (TGI), that compactly stores large volumes of historical graph evolution data in a partitioned and distributed fashion; it provides support for retrieving snapshots of the graph as of any timepoint in the past or evolution histories of individual nodes or neighborhoods; and a Spark-based Temporal Graph Analysis Framework (TAF), for expressing complex temporal analytical tasks and for executing them in an efficient and scalable manner. Our experiments demonstrate our system's efficient storage, retrieval and analytics across a wide variety of queries on large volumes of historical graph data.
- NScale: Neighborhood-centric Large-Scale Graph Analytics in the Cloud;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
To appear in VLDB Journal (also CoRR Technical Report arXiv:1405.1499).
[pdf]
[abstract] There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego network analysis, motif counting, finding social circles, personalized recommendations, analyzing influence cascades, etc. These tasks are not well served by the existing vertex-centric graph processing frameworks, whose computation, execution models limit the user program to directly access the state of a single vertex; this results in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks typically ignore the challenges in extracting the relevant portion of the graph that an analysis task needs, and loading it onto distributed memory. In this paper, we describe NScale, a novel end-to-end graph processing framework that enables the distributed execution of complex subgraph-centric analytics over large-scale graphs in the cloud. NScale enables users to write programs at the level of subgraphs, and to specify the subgraphs of interest declaratively. NScale uses Apache YARN, a state-of-the-art framework for efficient and fault-tolerant distribution of data and computation. It features GEL, a novel graph extraction and loading phase, that extracts the relevant portions of the graph and utilizes a cost-based optimizer to partition and load the graph onto distributed memory using as few machines as possible to minimize the communication cost. It utilizes novel techniques for the distributed execution of user computation that minimize memory consumption by exploiting overlap among the subgraphs of interest. Our experimental results show orders-of-magnitude improvements in performance, and drastic reductions in the cost of analytics, over vertex-centric approaches.
- GraphGen: Exploring Interesting Graphs in Relational Data (Demonstration Proposal);
Konstantinos Xirogiannopoulos, Udayan Khurana, and Amol Deshpande;
VLDB 2015.
[abstract] Analyzing interconnection structures among the data through the use of "graph algorithms" and "graph analytics" has been shown to provide tremendous value in many application domains. However, graphs are not the primary choice in the way that most data is currently stored, and users who want to employ graph analytics are forced to extract data from their data stores, construct the requisite graphs, and then use a specialized engine to write and execute their graph analysis tasks. This cumbersome and costly process not only raises barriers in using graph analytics, but also makes it hard to "explore" and "identify" hidden or implicit graphs in the data. Here we demonstrate a system, called "GraphGen", that enables users to declaratively specify graph extraction tasks over relational databases, to visually explore the extracted graphs, and to write and execute graph algorithms over them. We also demonstrate how unifying the extraction tasks and the graph algorithms enables significant optimizations that would not be possible otherwise.
- VERTEXICA: Your Relational Friend for Graph Analytics!;
Alekh Jindal, Praynaa Rawlani, Eugene Wu, Samuel Madden, Amol Deshpande, Mike Stonebraker;
VLDB Demo 2014.
[pdf]
[abstract] In this paper, we present Vertexica, a graph analytics tools on top of a relational database, which is user friendly and yet highly efficient. Instead of constraining programmers to SQL, Vertexica offers a popular vertex-centric query interface, which is more natural for analysts to express many graph queries. The programmers simply provide their vertex-compute functions and Vertexica takes care of efficiently executing them in the standard SQL engine. The advantage of using Vertexica is its ability to leverage the relational features and enable much more sophisticated graph analysis. These include expressing graph algorithms which are difficult in vertex-centric but straightforward in SQL and the ability to compose end-to-end data processing pipelines, including pre- and post-processing of graphs as well as combining multiple algorithms for deeper insights. Vertexica has a graphical user interface and we outline several demonstration scenarios including, interactive graph analysis, complex graph analysis, and continuous and time series analysis.
- NScale: Neighborhood-centric Analytics on Large Graphs;
Abdul Quamar, Amol Deshpande, Jimmy Lin;
VLDB Demo 2014.
[pdf]
[abstract] There is an increasing interest in executing rich and complex analysis tasks over large-scale graphs, many of which require processing and reasoning about a large number of multi-hop neighborhoods or subgraphs in the graph. Examples of such tasks include ego network analysis, motif counting in biological networks, finding social circles, personalized recommendations, link prediction, anomaly detection, analyzing influence cascades, and so on. These tasks are not well served by existing vertex-centric graph processing frameworks whose computation and execution models limit the user program to directly access the state of a single vertex, resulting in high communication, scheduling, and memory overheads in executing such tasks. Further, most existing graph processing frameworks also typically ignore the challenges in extracting the relevant portions of the graph that an analysis task is interested in, and loading it onto distributed memory.
In this demonstration proposal, we describe NScale, a novel end-to-end graph processing framework that enables the distributed execution of complex neighborhood-centric analytics over large-scale graphs in the cloud. NScale enables users to write programs at the level of neighborhoods or subgraphs. NScale uses Apache YARN for efficient and fault-tolerant distribution of data and computation; it features GEL, a novel graph extraction and loading phase, that extracts the relevant portions of the graph and loads them into distributed memory using as few machines as possible. NScale utilizes novel techniques for the distributed execution of user computation that minimize memory consumption by exploiting overlap among the neighborhoods of interest. A comprehensive experimental evaluation shows orders-of-magnitude improvements in performance and total cost over vertex-centric approaches.
- EAGr: Supporting Continuous Ego-centric Aggregate Queries over Large Dynamic Graphs;
Jayanta Mondal and Amol Deshpande;
SIGMOD 2014 (also CoRR Technical Report arXiv:1404.6570).
[pdf]
[abstract] In this work, we present EAGr, a system for supporting large numbers of continuous neighborhood-based ("ego-centric") aggregate queries over large, highly dynamic, and rapidly evolving graphs. Examples of such queries include computation of personalized, tailored trends in social networks, anomaly/event detection in financial transaction networks, local search and alerts in spatio-temporal networks, to name a few. Key challenges in supporting such continuous queries include high update rates typically seen in these situations, large numbers of queries that need to be executed simultaneously, and stringent low latency requirements. We propose a flexible, general, and extensible in-memory framework for executing different types of ego-centric aggregate queries over large dynamic graphs with low latencies. Our framework is built around the notion of an aggregation overlay graph, a pre-compiled data structure that encodes the computations to be performed when an update/query is received. The overlay graph enables sharing of partial aggregates across multiple ego-centric queries (corresponding to the nodes in the graph), and also allows partial pre-computation of the aggregates to minimize the query latencies. We present several highly scalable techniques for constructing an overlay graph given an aggregation function, and also design incremental algorithms for handling structural changes to the underlying graph. We also present an optimal, polynomial-time algorithm for making the pre-computation decisions given an overlay graph, and evaluate an approach to incrementally adapt those decisions as the workload changes. Although our approach is naturally parallelizable, we focus on a single-machine deployment and show that our techniques can easily handle graphs of size up to 320 million nodes and edges, and achieve update/query throughputs of over 500K/s using a single, powerful machine.
- Subgraph Pattern Matching over Uncertain Graphs with Identity Linkage Uncertainty;
Walaa Eldin Moustafa, Angelika Kimmig, Amol Deshpande, Lise Getoor;
ICDE 2014 (also CoRR Technical Report arXiv:1305.7006).
[pdf]
[abstract] There is a growing need for methods which can capture uncertainties and answer queries over graph-structured data. Two common types of uncertainty are uncertainty over the attribute values of nodes and uncertainty over the existence of edges. In this paper, we combine those with identity uncertainty. Identity uncertainty represents uncertainty over the mapping from objects mentioned in the data, or references, to the underlying real-world entities. We propose the notion of a probabilistic entity graph (PEG), a probabilistic graph model that defines a distribution over possible graphs at the entity level. The model takes into account node attribute uncertainty, edge existence uncertainty, and identity uncertainty, and thus enables us to systematically reason about all three types of uncertainties in a uniform manner. We introduce a general framework for constructing a PEG given uncertain data at the reference level and develop highly efficient algorithms to answer subgraph pattern matching queries in this setting. Our algorithms are based on two novel ideas: context-aware path indexing and reduction by join-candidates, which drastically reduce the query search space. A comprehensive experimental evaluation shows that our approach outperforms baseline implementations by orders of magnitude.
- Stream Querying and Reasoning on Social Data;
Jayanta Mondal and Amol Deshpande;
Chapter in Encyclopedia of Social Network Analysis and Mining, Springer, 2014.
[pdf]
[abstract] In this paper, we present an introduction to the new research area of "stream querying and reasoning" over social data. This area combines aspects from several well-studied research areas, chief among them, social network analysis, graph databases, and data streams. We provide a formal definition of the problem, survey the related prior work, and discuss some of the key research challenges that need to be addressed (and some of the solutions that have been proposed). We note that we use the term "stream reasoning" in this paper to encompass a broad range of tasks including various types of analytics, probabilistic reasoning, statistical inference, and logical reasoning. We contrast our use of this term with the recent work where this term has been used more specifically to refer to integration of logical reasoning systems with data streams in the context of the Semantic Web. Given the vast amount of work on this and related topics, it is not our intention to be comprehensive in this brief overview. Rather we aim to cover some of the key ideas and representative work.
- GrDB: A System for Declarative and Interactive Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Hui Miao, Amol Deshpande, Lise Getoor;
SIGMOD Demo 2013.
[abstract] There is a growing interest in methods for analyzing data describing networks of all types, including biological, physical, social, and scientific collaboration networks. Typically the data describing these networks is observational, and thus noisy and incomplete; it is often at the wrong level of fidelity and abstraction for meaningful data analysis. This demonstration presents GrDB, a system that enables data analysts to write declarative programs to specify and combine different network data cleaning tasks, visualize the output, and engage in the process of decision review and correction if necessary. The declarative interface of GrDB makes it very easy to quickly write analysis tasks and execute them over data, while the visual component facilitates debugging the program and performing fine grained corrections.
- HiNGE: Enabling Temporal Network Analytics at Scale;
Udayan Khurana, and Amol Deshpande;
SIGMOD Demo 2013.
[abstract] In this demonstration proposal, we present HiNGE (Historical Network/Graph Explorer), a system that enables interactive exploration and analytics over large evolving networks through visualization and node-centric metric computations. HiNGE is built on top of a distributed graph database system that stores the entire history of a network, and enables efficiently retrieving and analyzing multiple graph snapshots from arbitrary time points in the past. The cornerstone of our system is a novel hierarchical parallelizable index structure, called DeltaGraph, that enables compact recording of the historical trace of a network on disk, and supports efficient retrieval of historical snapshots for single-site or parallel processing. The other key component of our system is an in-memory graph data structure, called GraphPool, that can maintain hundreds of historical graph snapshots in main memory in a non-redundant manner. We demonstrate the efficient and usability of our system at performing temporal analytics over large-scale dynamic networks.
- Efficient Snapshot Retrieval over Historical Graph Data;
Udayan Khurana, and Amol Deshpande;
ICDE 2013 (also CoRR Technical Report arXiv:1207.5777).
[pdf]
[abstract] We address the problem of managing historical data for large evolving information networks like social networks or citation networks, with the goal to enable temporal and evolutionary queries and analysis. We present the design and architecture of a distributed graph database system that stores the entire history of a network and provides support for efficient retrieval of multiple graphs from arbitrary time points in the past, in addition to maintaining the current state for ongoing updates. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. We introduce DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compactly recording the historical information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Along with the original graph data, DeltaGraph can also maintain and index auxiliary information; this functionality can be used to extend the structure to efficiently execute queries like subgraph pattern matching over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right parameters for a specific scenario. In addition, we present strategies for materializing portions of the historical graph state in memory to further speed up the retrieval process. Secondly, we present an in-memory graph data structure called GraphPool that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information.
- Managing Large Dynamic Graphs Efficiently;
Jayanta Mondal, Amol Deshpande;
SIGMOD 2012.
[pdf]
[abstract] There is an increasing need to ingest, manage, and query large volumes of graph-structured data arising in applications like social networks, communication networks, biological networks etc. Graph databases that can explicitly reason about the graphical nature of the data, that can support flexible schemas and node/edge-centric analysis and querying, are ideal for storing such data. However, although there is much work on single-site graph databases and on efficiently executing specific types of queries over large graphs, to date there is little work on understanding the challenges in distributed dynamic graph data management, needed to handle the large scale of such data. In this paper, we propose the design of an in-memory, distributed graph data management system aimed at managing a large dynamically changing graph and supporting low-latency query processing over it. The key challenge in a distributed graph database is that, partitioning a graph across a set of machines inherently results in a large number of distributed traversals across partitions to answer even simple queries. We propose a suite of novel techniques to minimize the communication bandwidth and the storage requirements. We have implemented our framework as a middle-ware on top of an open-source key-value store. We evaluate our system on a social graph, and show that our system is able to handle very large graphs efficiently, and that it reduces the network bandwidth consumption significantly.
- Ego-centric Graph Pattern Census;
Walaa Eldin Moustafa, Amol Deshpande, Lise Getoor;
ICDE 2012.
[abstract] There is increasing interest in analyzing networks of all types including social, biological, sensor, computer, and transportation networks. Broadly speaking, we may be interested in global network-wide analysis (e.g., centrality analysis, community detection) where the properties of the entire network are of interest, or local ego-centric analysis where the focus is on studying the properties of nodes (egos) by analyzing their neighborhood subgraphs. In this paper we propose and study ego-centric pattern census queries, a new type of graph analysis query, where a given structural pattern is searched in every node's neighborhood and the counts are reported or used in further analysis. This kind of analysis is useful in many domains in social network analysis including opinion leader identification, node classification, link prediction, and role identification. We propose an SQL-based declarative language to support this class of queries, and develop a series of efficient query evaluation algorithms for it. We evaluate our algorithms on a variety of synthetically generated graphs. We also show an application of our language in a real-world scenario for predicting future collaborations from DBLP data.
- Declarative Analysis of Noisy Information Networks;
Walaa Eldin Moustafa, Galileo Namata, Amol Deshpande, Lise Getoor;
ICDE Workshop on Graph Data Management: Techniques and Applications (GDM 2011).
[abstract] There is a growing interest in methods for analyzing data describing networks of all types, including information, biological, physical, and social networks. Typically the data describing these networks is observational, and thus noisy and incomplete; it is often at the wrong level of fidelity and abstraction for meaningful data analysis. This has resulted in a growing body of work on extracting, cleaning, and annotating network data. Unfortunately, much of this work is ad hoc and domain-specific. In this paper, we present the architecture of a data management system that enables efficient, declarative analysis of large-scale information networks. We identify a set of primitives to support the extraction and inference of a network from observational data, and describe a framework that enables a network analyst to easily implement and combine new extraction and analysis techniques, and efficiently apply them to large observation networks. The key insight behind our approach is to "decouple", to the extent possible, (a) the operations that require traversing the graph structure (typically the computationally expensive step), from (b) the operations that do the modification and update of the extracted network. We present an analysis language based on "Datalog", and show how to use it to cleanly achieve such decoupling. We briefly describe our prototype system that supports these abstractions. We include a preliminary performance evaluation of the system and show that our approach scales well and can efficiently handle a wide spectrum of data cleaning operations on network data.
This material is based upon work supported in part by the National Science Foundation under
Grants
.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science Foundation.