Graph is an abstract data structure that has high expressive power. Many real-life applications can be modeled by graphs, including biological networks, semantic web and social networks. Thus, a graph engine is important to many applications. Currently, there are several players in this field, including Neo4j, HyperGraphDB, InfiniteGraph, etc. Neo4j is a disk-based transactional graph database. HyperGraphDB is based on key/value pair store Berkeley DB. InfiniteGraph is a distributed system for large graph data analysis.
In 2009, Google announced Pregel as its large scale graph processing platform. Pregel is a batch system, and it does not support online query processing or graph serving. In comparison, Trinity supports both online query and offline batch processing. Furthermore, batch processing in Pregel is strictly synchronized, while Trinity supports asynchronized computation for better performance.
Features of Trinity
- Data model: hypergraph.
- Distributed: Trinity can be deployed on one machine or hundreds of machines.
- A graph database: Trinity is a memory-based graph store with rich database features, including highly concurrent online query processing, ACI transaction support, etc. Currently, Trinity provides C# APIs to the user for graph processing.
- A parallel graph processing system: Trinity supports large scale, offline batch processing. Both Synchronous and Asynchronous batch computation is supported.
Graph Model
Trinity adopts the hypergraph model. The difference between a simple graph and a hypergraph is that an edge in a hypergraph (called hyperedge) connects an arbitrary number of nodes, while an edge in a simple graph connects two nodes only.Hypergraphs are more general than simple graphs:
- A hypergraph model is more intuitive to many applications, because many relationships are not one-one relationships.
- Some multilateral relationships cannot easily be modeled by simple graphs. Naïve modeling by simple graphs often leads to information loss.
Trinity is a Distributed Graph Database
A graph database should support some essential database features, such as indexing for query, transactions, concurrency control and consistency maintenance.Trinity supports content-rich graphs. Each node (or edge) is associated with a set of data, or a set of key/value pairs. In other words, nodes and edges in Trinity are of heterogeneous types.
Trinity is optimized for concurrent online query processing. When deployed on a single machine, Trinity can access 1,000,000 nodes in one second (e.g., when performing BFS). When deployed over a network, the speed is affected by network latency. Trinity provides a graph partitioning mechanism to minimize latency. We are deploying Trinity on infiniband networks, and we will report results soon.
To support highly efficient online query processing, Trinity deploys various types of indices. Currently, we provide trie and hash for accessing node/edge names and key/value pairs associated with nodes/edges. We are implementing structural index for subgraph matching.
Trinity also provides support for concurrent updates on graphs. It implements transaction, concurrency control, and consistency.
Currently, Trinity does not have a graph query language yet. Graph accesses are performed through C# APIs. We are designing a high level query language for Trinity.
Trinity is a Distributed Parallel Platform for Graph Data
Many operations on graphs are carried out in batch mode, for example, PageRank, shortest path discovery, frequent subgraph mining, random walk, graph partitioning, etc.Like Google's Pregel, Trinity supports node-based parallel processing on graphs. Through a web portal, the user provides a script (currently C# code or a DLL) to specify the computation to be carried out on a single node, including what messages it passes to its neighbors. The system will carry out the computation in parallel.
Unlike Google's Pregel, operations on nodes do not have to be conducted in strictly synchronous manner. Certain operations (e.g., shortest path discovery) can be performed in an asynchronous mode for better performance.
As an example, here is the code for synchronous shortest path search (pseudocode, C# code), and here is the code for asynchronous shortest path search (pseudocode, C# code).
We are also designing a high level language so that users can write their scrips with ease.
Trinity Architecture
Trinity is based on memory cloud. It uses memory as the main storage and disk is only used as the backup storage.Applications
As more and more applications handle graph data, we expect Trinity will have many applications. Currently, Trinity is supporting the following two applications: Probase (a research prototype) and AEther (a production system). If your applications require graph engine support, please let us know.Trinity is the infrastructure of Probase, a large-scale knowledgebase automatically acquired from the web. Probase has millions of nodes (representing concepts) and edges (represent relationships). Hypergraphs are more appropriate than simple graphs for modeling knowledge. Trinity is used for: 1) taxonomy building; 2) data integration (e.g. adding Freebase data into Probase); 3) querying Probase.Microsoft Bing’s AEther project now uses Trinity for managing AEther’s experimental data, which consists of large number of workflows, and the evolutions among the workflows. Trinity is the backend graph storage engine of AEther's workflow management system. We are adding more functionalities, in particular, subgraph matching and frequent subgraph mining, to support the project.
Project Contact
Bin Shao(binshao@microsoft.com)Haixun Wang ((haixunw@microsoft.com) Sphere: Related Content
No hay comentarios:
Publicar un comentario