## ABSTRACT

In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported, thus providing a flexible data management engine for diverse types of knowledge graph. The engine itself is founded on a combination of tried and tested techniques from relational data management, state-of-the-art algorithms for worst-case-optimal joins, as well as graph-specific algorithms for evaluating path queries. In this paper, we present the main design principles underlying MillenniumDB, describing the abstract graph model and query semantics supported, the concrete data model and query syntax implemented, as well as the storage, indexing, query planning and query evaluation techniques used. We evaluate MillenniumDB over real-world data and queries from the Wikidata knowledge graph, where we find that it outperforms other popular persistent graph database engines (including both enterprise and open source alternatives) that support similar query features.

## 1. INTRODUCTION

Recent years have seen growing interest in graph databases [1], wherein nodes represent entities of interest, and edges represent relations between those entities. In comparison with alternative data models, graphs offer a flexible and often more intuitive representation of particular domains [2]. Graphs forgo the need to define a fixed (e.g., relational) schema for the domain upfront, and allow for modeling and querying cyclical relations between entities that are not well-supported in other data models (e.g., tree-based models, such as XML and JSON). Graphs have long been used as an intuitive way to model data in domains such as social networks, transport networks, genealogy, biological networks, etc. Graph databases further enable specific forms of querying, such as path queries that find entities related by arbitrary-length paths in the graph. Graph databases have become popular in the context of NoSQL [3], where alternatives to relational databases are sought for specialized scenarios; Linked Data [4], where graph-structured data are published and interlinked on the Web; and recently Knowledge Graphs [5], where diverse data are integrated at large scale into a graph.

Recent years have also seen a growing number of models, languages, techniques and systems emerge for managing and querying graph databases [1, 2]. In the context of NoSQL systems, Neo4j [6], which uses the query language Cypher [7] and the property graph data model [2], is a leading graph database system in practice.^{①} Other popular graph database systems include ArangoDB [8], JanusGraph [9], OrientDB [10], TigerGraph [11], etc., which support Gremlin [12] and other custom graph query languages. We also find graph database systems supporting the RDF data model and SPARQL query language [13], including Allegrograph [14], Amazon Neptune [15], Blazegraph [16], GraphDB [17], Jena TDB [18], Stardog [19], Virtuoso [20], and (many) more besides [13]. There are now many graph database systems to choose from.

Many of these graph databases implement their own graph model, query language, etc. An open challenge is to design a graph database engine that is both *interoperable,* i.e., able to seamlessly support the diverse graph data models now popular in practice; and *efficient,* i.e., achieving query performance comparable (or ideally better than) systems built with a specific graph data model in mind. These goals are key to use-cases involving diverse, large-scale knowledge graphs. One concrete example is that of the Wikidata knowledge graph [21], which is composed of billions of statements, tens of thousands of node types, thousands of edge types, complex meta-data on edges, etc. Wikidata's query service—currently powered by Blazegraph [16]—receives in the order of millions of queries per day [22]. Interoperability in this setting would allow clients to seamlessly query Wikidata using their preferred syntax. However, as we will argue in Section 2.2, none of the graph models implemented by the aforementioned engines is well-suited for knowledge graphs like Wikidata; for example, for representing meta-data on edges (per Wikidata's *qualifiers* [21]), RDF requires reification [23], which adds bloat and indirection to the graph, while property graphs cannot link edges to nodes [24], requiring syntactic workarounds. Our abstract graph model—which we call *domain graphs*—is designed with such knowledge graphs in mind [24].^{②} Efficiency then involves supporting more clients posing increasingly complex queries in less time.

Inspired by use-cases such as Wikidata, we propose MillenniumDB: an open-source graph database engine designed from the ground-up in order to achieve both interoperability and efficiency. To achieve *interoperability,* rather than implement the different data models from scratch, we rather adopt a common graph data model, called domain graphs, which can represent popular graph models in practice, and upon which MillenniumDB is founded. We then abstract the key query features common to different graph database engines, capturing them first as a formal query language over which concrete query syntax can be layered. Within this combination, we cover novel features, such as returning shortest paths (like Cypher) that match regular expressions (like SPARQL). In terms of *efficiency,* it is not trivial to know, *a priori,* which techniques are well-suited for evaluating real-world workloads over our model. To achieve efficient query processing, we thus incorporate a mix of both traditional and state-of-the-art techniques for evaluating graph patterns and path queries, adapting them to the domain graph model, and evaluating them empirically in a real-world setting to ascertain which offer the best performance in practice.

*Contributions.* The contributions of this paper are as follows:

the domain graph and property domain graph data models, which allow for succinctly representing graph data models popular in practice, including RDF graphs, RDF-star graphs, property graphs, and the Wikidata knowledge graph [21];

a formal query language based on domain graphs that captures key features of popular query languages for graph databases, along with a concrete query syntax;

an indexing scheme and query engine designed for domain graphs that incorporates both traditional and state-of-the-art techniques, with optimizations dedicated to the evaluation of graph patterns and path queries;

experiments over the Wikidata knowledge graph [21], involving real-world graph data and queries, comparing algorithms internal to MillenniumDB as well as other graph database engines.

Our experimental results highlight the benefits, for example, of incorporating worst-case-optimal join algorithms when evaluating complex graph patterns (with many joins) versus a more traditional approach based on applying binary joins with a Selinger-based query engine. We further compare the performance associated with different graph search algorithms in the context of path queries. On a more practical note, we show that MillenniumDB, under optimal configurations, clearly outperforms prominent graph database systems—namely Blazegraph, Neo4j, Jena and Virtuoso—and discuss why. We further publish a first release of MillenniumDB as an open source graph database engine [25], which we plan to extend in future in order to support more query syntax, query features, transactional updates, index structures, and more.

*Paper structure.* The rest of this paper is structured as follows: In Section 2 we describe existing graph data models and their limitations. In Section 3, we propose *domain graphs* as an abstraction of these models used in MillenniumDB. In Section 4, we describe the query language of MillenniumDB, and how it takes advantage of domain graphs. In Section 5, we explain how MillenniumDB stores data and evaluates queries. In Section 6, we provide an experimental evaluation of the proposed methods on a large body of queries over the Wikidata knowledge graph. In Section 7, we provide some concluding remarks and ideas for future research.

## 2. EXISTING GRAPH DATA MODELS AND THEIR LIMITATIONS

In this section we briefly recap the popular graph data models in use today, and discuss their limitations when modeling real-world datasets.

### 2.1 Graph Data Models

*RDF and RDF∗..* One of the simplest models used for representing knowledge graphs is based on *directed labeled graphs,* composed of a set of edges of the form where *a* is called the source node, *b* the edge label, and *c* the target node. In the context of knowledge graphs, nodes are used to represent entities and edges represent binary relations between pairs of entities. For example, with the edge in Figure 1 we can state that Michelle Bachelet was (or is) the president of Chile.

Such graphs are the basis of RDF [27], where the source node, edge label and target node are called *subject, predicate* and *object,* respectively. Given a universe Obj of objects (ids, strings, numbers, IRIs, etc.)^{③} the RDF data model is defined as follows:

**Definition 1.***An* RDF triple *is an element (s, p, o*) ∊ Obj × Obj × Obj. *An* RDF graph *is a finite set of RDF triples.*

Upon analyzing this definition, we can immediately notice that the RDF data model lacks the ability to directly refer to the edge (*s*, *p*, *o*) itself. For instance, if we wanted to add the information about when the presidency represented by the above edge starts and when it ends, we would have to resort to some sort of *reification,* which would introduce an artificial object representing the edge that can then be linked to the start and end date information. For example, the (reified) triples representing the duration of this presidency could be represented as shown in Figure 2. The reification is given by the use of the edges labeled as source, label and target.

In order to avoid the need for reification, an extension of the RDF data model called *RDF∗* (or *RDF-star)* was proposed [28]. Intuitively, in RDF∗ an entire triple can appear as a subject or an object in another triple. For example, in Figure 3, we are modeling the fact that Michelle Bachelet was the president of Chile from 2014-03-11 to 2018-03-11.

The node representing the edge is called a *quoted triple* [29]. To distinguish edges that originate in a quoted triple, in Figure 3 we denote them with a dotted line. Formally, the RDF∗ data model can be defined as follows:

**Definition 2** ([28]). *An* RDF∗ triple *is defined recursively as follows:*

*An RDF triple*(s, p, o)*is an RDF∗ triple; and**If s, o are RDF∗ triples or elements of*Obj,*and p ∊*Obj,*then*(*s*,*p*,*o*)*is an RDF∗ triple.*

*An* RDF∗ graph *is a finite set of RDF∗ triples.*

Another model extending RDF is that of *RDF datasets* [27], which are typically used to represent and manage multiple named RDF graphs. This model can be defined in two manners. The first, most general, definition permits empty graphs.

**Definition 3.***An* RDF dataset *is defined as a pair D* = (*G,* {(*n*_{1}, *G*_{1}), …, (*n _{k}*,

*G*)})

_{k}*where:*

*G*,*G*_{1,}…,*G*_{k}are RDF graphs; and*n*_{1}, …,*n*1_{k}are objects such that n_{i}≠ n_{j}for*≤ i < j*≤*k*.

*The graph G is called the* default graph, *while each pair (n _{1}, G_{i}) is called a* named graph,

*composed of the*name

*n*

_{i}and its corresponding RDF graph G_{j}.For example, letting *G* denote the RDF graph of Figure 1, and letting *G _{1}* denote the RDF graph of Figure 2, then we can capture both graphs separately in an RDF dataset of the form

*D*= (

*G*, {(

*n*

_{1},

*G*

_{1})}), where

*n*is a name (e.g., reified) used to reference the graph

_{1}*G*It is common to simply represent RDF datasets as a set of quads of the form (

_{1}.*s, p, o, g*) ∊ Obj × Obj × Obj × Obj [30], which indicates that the RDF triple (

*s, p, o*) is in the RDF graph with name

*g*. In this quad-based view, for example,

*D*would then contain the quad (

*e*

_{1}, source, Michelle Bachelet, reified). A special name can be reserved in order to denote the default graph; for example (Michelle Bachelet, position held, President of Chile, default). This quad-based definition cannot directly support naming empty RDF graphs (though it could be extended to incorporate a set of names for empty graphs).

*Property graphs..* Finally, one of the more popular graph data models is that of *property graphs* [7]. Property graphs extend the simple edge labeled directed graph with two additional features: (i) they assign explicit identifiers to nodes and edges, so that one can refer to them; and (ii) they allow for annotating both nodes and edges with a set of property-value pairs. For example, the information from Figure 8 can be equivalently represented by the property graph in Figure 4.

Here the nodes have identifiers *(n _{1}, n*

_{2}) as well as labels (human, public office). Similarly, edges have both identifiers (

*e*

_{1}) and labels (position held). A node can have multiple labels, while an edge always has a single label (often referred to as its

*type).*The edge e1 has two

*properties,*namely start date and end date, each with an associated value. Formally, if Obj is a set of objects, ℒ is a set of labels, $L$ a set of properties, and $P$ a set of values, we define the property graph data model as follows:

**Definition 4.***A* property graph *is a tuple G* = (*V, E, src, tgt, lab, prop), where:*

*V*⊂ Obj*is finite set of node identifiers;**E*⊂ Obj*is finite set of edge identifiers disjoint from V;**src: E → V assigns a source node to each edge;**tgt*:*E*→*V assigns a target node to each edge;**lab:*(*V*∪*E*) → 2ℒ*is a function assigning a finite set of labels to nodes and edges, with*|*lab(e)*| = 1*for all e ∊ E; and**prop*: (*V*∪*E*) × $L$ → $P$*is a partial function assigning a value to a certain property of a node or an edge.*

*Moreover, we assume that for each object o ∊ V ∪ E, there exists a finite number of properties p ∊ $L$ such that prop(o, p) is defined.*

### 2.2 Limitations of Existing Models

While all of the described data models have great expressive power, they are sometimes cumbersome to use when representing real-world datasets that contain higher-arity relations. To illustrate this, we will use the Wikidata [21, 31, 5] knowledge graph. Consider the two Wikidata statements shown in Figure 5. Both statements claim that Michelle Bachelet was a president of Chile, and both are associated with nested *qualifiers* that provide additional information: in this case a start date, an end date, who replaced her, and whom she was replaced by. There are two statements for two distinct presidencies. Also the ids for objects (for example, Q320 and P39) are shown; any positional element can have an id and be viewed as a node in the knowledge graph.

As aforementioned, representing statements like this in RDF graphs requires reification to decompose *n*-ary relations into binary relations [23]. Figure 6 shows a graph where *e*_{1} and *e*_{2} are nodes representing two distinct *n*-ary relationships (an extended version of Figure 2). For greater readability, we use human-readable nodes and labels, where in practice, the node will rather be given as the identifier , and the edge type “replaces” will rather be given as “P155”.

Since property graphs allow labels and property-value pairs to be associated with both nodes and edges, reification can be avoided in our example. For instance, the statements of Figure 5 can be represented as the property graph in Figure 7. Though more concise than reification, labels, properties and values are considered to be simple strings, which are disjoint with nodes; for example, Ricardo Lagos is neither a node nor a pointer to a node, but a string, which would complicate, for example, querying for the parties of presidents that Michelle Bachelet replaced.

On the other hand, RDF∗ allows an edge to be a node. For example, the first statement of Figure 5 can be represented in RDF∗ as shown in Figure 8. However, we can only represent one of the statements (without reification), as we can only have one distinct node per edge; if we add the qualifiers for both statements, then we would not know which start date pairs with which end date, for example.^{④}

Regarding RDF datasets, we could model both statements by creating two named graphs, each with a copy of the statement that Michele Bachelet has been President of Chile, thereafter defining the start date, end date, replaced by and replaces annotations in another graph using the graph name. The resulting quads could thus be as follows if we define the latter information in the default graph (for example):

This is quite a concise way to model the aforementioned Wikidata statements, wherein we effectively use graph names to assign each edge a unique id that serves as a graph node elsewhere. Indeed, the data model we propose follows a similar idea. However, RDF datasets were defined in the context of managing several (named) graphs, where using them to define edge ids gives rise to several complications; for example, SPARQL does not support evaluating path queries that span different named graphs.

## 3. DATA MODEL UNDERLYING MILLENNIUMDB

In this section, we present the graph data model upon which MillenniumDB is based, called domain graphs, and discuss how it generalizes existing graph data models such as RDF and property graphs. We also show its utility in concisely modeling real-world knowledge graphs that contain higher-arity relations, such as Wikidata [21].

### 3.1 Domain Graphs

The structure of knowledge graphs is captured in MillenniumDB via *domain graphs,* which follow the natural idea of assigning ids to edges in order to capture higher-arity relations within graphs [23, 32, 33, 24]. Formally, assume a universe Obj of objects (ids, strings, numbers, IRIs, etc.). We define domain graphs as follows:

**Definition 5.***A* domain graph *G* = (*O*, γ) *consists of a finite set of objects O* ⊆ Obj *and a partial mapping γ*: *O → O × O × O.*

Intuitively, *O* is the set of database objects and γ models edges between objects. If γ(*e*) = *(n _{1,} t, n*

_{2}), this states that the edge

*(n*

_{1,}t, n_{2}) has id

*e*, type

*t*, and links the source node

*n*to the target node

_{1}*n*

_{2}.

^{⑤}We can analogously define our model as a relation:

where eid (edge id) is a primary key of the relation.

The domain graph model of MillenniumDB already subsumes the RDF graph model [27]. Recall that an RDF graph is a set of triples of the form (*a, b, c*). To show how RDF is modeled in domain graphs, consider again the RDF triple from Figure 1. We can encode this triple in a domain graph by storing the tuple (Michelle Bachelet, position held, President of Chile, e) in the DomainGraph relation, where e denotes a unique (potentially auto-generated) edge id, or equivalently stating that:

The id of the edge itself is not needed in the RDF data model, but it can be used for modeling RDF-star (RDF∗) graphs [28, 29]. For example, to represent the RDF∗ graph from Figure 8, we can extend the function γ with two additional statements:

Here we use two new edges, e1 and e2, which have the edge e as their starting node.

For stricter backwards compatibility with legacy property graphs (where desired), MillenniumDB implements a simple extension of the domain graph model, called *property domain graphs,* which allows for *external annotation,* i.e., adding labels and property-value pairs to nodes and edges without creating new nodes and edges. Formally, if ℒ is a set of labels, $L$ a set of properties, and $P$ a set of values, we define a property domain graph as follows:

**Definition 6.***A* property domain graph *is defined as a tuple G =* (*O*, γ, *lab, prop), where:*

(

*O, γ*)*is a domain graph;**lab*:*O →*2ℒ*is a function assigning a finite set of labels to an object; and**prop: O × $L$ → $P$ is a partial function assigning a value to a certain property of an object.*

*Moreover, we assume that for each object o ∊ O, there exists a finite number of properties p ∊ $L$ such that prop(o, p) is defined.*

While domain graphs (without properties) can directly capture property graphs—where, for example, the property-value pair (gender, “female”) on node *n _{1}* can be represented by an edge γ(

*e*

_{3}) = (

*n*

_{1,}gender, female), the property-value pair (order, “2”) on an edge

*e*

_{2}becomes γ(

*e*

_{4}) = (

*e*

_{2}, order, 2), the label on

*n*becomes γ(

_{1}*e*

_{5}) = (

*n*

_{1,}label, human), the label on

*e*

_{1}becomes the type of the edge γ(

*c*

_{1}) = (

*n*

_{1,}father,

*n*

_{2}), etc.

^{⑥}—this can generate “incompatibilities” between the legacy property graph and the resulting domain graph; for example, strings like “male”, labels like human, etc., now become nodes in the graph, generating new paths through them that may affect query results. Property domain graphs thus offer an extra layer of flexibility, and interoperability with legacy property graphs, where needed for a given use-case.

To illustrate how property domain graphs work, consider the property graph (as introduced in Definition 4) from Figure 9. To model this information via property domain graphs, we use the domain graph part to capture the graph structure of our model, while property domain graphs also permit annotating that graph structure with labels and property-value pairs. The property graph in Figure 9 can be represented with the following property domain graph *G* = (*O*, γ, lab, prop), where the graph structure is as follows:

and the annotations of the graph structure are as follows:

lab(n_{1}) = human | prop(n_{1,} last name) = “Bachelet” |

lab(n_{2}) = human | prop(n_{2}, gender) = “male” |

prop(e_{2}, order) = “2” | prop(n_{2}, children) = “2” |

prop(n_{1}, gender) = “female” | prop(n_{2}, first name) = “Alberto” |

prop(n_{1}, children) = “3” | prop(n_{2}, last name) = “Bachelet” |

prop(n first name) = “Michelle” _{1}, | prop(n_{2}, death) = “12 March 1974” |

lab(n_{1}) = human | prop(n_{1,} last name) = “Bachelet” |

lab(n_{2}) = human | prop(n_{2}, gender) = “male” |

prop(e_{2}, order) = “2” | prop(n_{2}, children) = “2” |

prop(n_{1}, gender) = “female” | prop(n_{2}, first name) = “Alberto” |

prop(n_{1}, children) = “3” | prop(n_{2}, last name) = “Bachelet” |

prop(n first name) = “Michelle” _{1}, | prop(n_{2}, death) = “12 March 1974” |

The relational representation of property domain graph then adds two new relations alongside DOMAINGRAPH:

Labels(object, label),

Properties(object, property, value),

where object, property is a primary key of the second relation, with the first relation allowing multiple labels per object.

### 3.2 Domain Graphs Compared with Other Graph Data Models

Why did we choose (property) domain graphs as the model of MillenniumDB? As discussed in the previous section, it can be used to model both directed labeled graphs (like RDF) as well as property graphs. It also has a natural relational expression, which facilitates its implementation in a query engine. But it is also heavily inspired by the needs of real-world knowledge graphs like Wikidata [21, 31, 5]. To illustrate its versatility, consider again the Wikidata statements shown in Figure 5. Note that edges that originate in another edge are drawn with a dotted line.

As discussed in Section 2, neither RDF nor RDF∗ can represent these statements without resorting to reification, while property graphs cannot take nodes as values for properties. The domain graph model allows us to capture higher-arity relations more directly. In Figure 10 we present one possible representation of the statements from Figure 5. We only show edge ids as needed (all edges have ids). We do not use the “property part” of our data model for external annotation, considering that the elements of Wikidata statements shown can form nodes in the graph itself.

Domain graphs are similar to named graphs in RDF datasets. Both domain graphs and RDF datasets can be represented as quads. However, the edge ids of domain graphs identify each quad, which, as we will discuss in Section 5, necessitates fewer index permutations. RDF datasets were proposed to represent multiple RDF graphs for publishing and querying. SPARQL thus does not support querying paths that span different named graphs; to support path queries over singleton named graphs, all edges would need to be duplicated (virtually or physically) into a single graph [23]. Named graphs could be supported in domain graphs using a reserved term graph, and edges of the form γ(*e*_{3}) = (*e*_{1,} graph, g_{1}), γ(*e*_{4}) = (*e*_{2}, graph, g_{1}); optionally, *named domain graphs* could be considered in the future to support multiple domain graphs.

The idea of assigning ids to edges/triples for similar purposes as described here is a natural one, and not new to this work. Hernandez et al. [23] explored using singleton named graphs in order to represent Wikidata qualifiers, placing one triple in each named graph, such that the name acts as an id for the triple. In parallel with our work, recently a data model analogous to domain graphs has been independently proposed for use in Amazon Neptune, which the authors call 1G [33]. Their proposal does not discuss a formal definition for the model, nor a query language, storage and indexing, implementation, etc., but the reasoning and justification that they put forward for the model is similar to ours. Similar such models have been generalized as *multilayer graphs* [24], where the appearance of edge ids within the graph induces different layers of reference. Our work proposes a novel query language, storage and indexing schemes, query planner—and ultimately a fully-fledged graph database engine—built specifically for this model. Furthermore, with property domain graphs, we support annotation external to the graph, which we believe to be a useful extension that enables better compatibility with property graphs.

Table 1 summarizes the features that are directly supported by the respective graph models themselves without requiring *reserved terms,* which would include, for example, source, label and target in Figure 6 (all features except *External annotation* can be supported in all models with reserved vocabulary). Reserved terms can add indirection to modeling (e.g., reification [23]), and can clutter the data, necessitating more tuples or higher-arity tuples to store, leading to more joins and/or index permutations. The features are then defined as follows, considering directed (labeled) edges:

. | RDF . | RDF∗ . | RD . | PG . | DG . | PDG . |
---|---|---|---|---|---|---|

Edge type/label | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

Node label | X | X | X | ✓ | X | ✓ |

Edge annotation | x | ✓ | ✓ | ✓ | ✓ | ✓ |

Node annotation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

External annotation | X | X | X | ✓ | X | ✓ |

Edge as node | X | ✓ | ✓ | X | ✓ | ✓ |

Edge as nodes | X | X | ✓ | X | ✓ | ✓ |

Nested edge nodes | X | ✓ | ✓ | X | ✓ | ✓ |

Graph as node | X | X | ✓ | X | X | X |

. | RDF . | RDF∗ . | RD . | PG . | DG . | PDG . |
---|---|---|---|---|---|---|

Edge type/label | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

Node label | X | X | X | ✓ | X | ✓ |

Edge annotation | x | ✓ | ✓ | ✓ | ✓ | ✓ |

Node annotation | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

External annotation | X | X | X | ✓ | X | ✓ |

Edge as node | X | ✓ | ✓ | X | ✓ | ✓ |

Edge as nodes | X | X | ✓ | X | ✓ | ✓ |

Nested edge nodes | X | ✓ | ✓ | X | ✓ | ✓ |

Graph as node | X | X | ✓ | X | X | X |

*Edge type/label:*assign a type or label to an edge.*Node label:*assign labels to nodes.*Edge annotation:*assign property-value pairs to an edge.*Node annotation:*assign property-value pairs to a node.*External annotation:*nodes/edges can be annotated without adding new nodes or edges.*Edge as node:*an edge can be referenced as a node (this allows edges to be connected to nodes of the graph).*Edge as nodes:*a single unique edge can be referenced as multiple nodes.*Nested edge nodes:*an edge involving an edge node can itself be referenced as a node, and so on, recursively.*Graph as node:*a graph can be referenced as a node.

Some unsupported features in Table 1 are more benign than others; for example, *Node label* requires a reserved term (e.g., rdf:type), but no extra tuples; on the other hand, *Edge as node* requires reification, using at least one extra tuple, and also a reserved term.

Wikidata requires *Edge as nodes* as per Figure 5, where values like Ricardo Lagos are themselves nodes. Only RDF datasets, domain graphs and property domain graphs can model such examples without reserved terms; however, the use of RDF datasets requires co-opting graph names, which are typically used to manage multiple graphs, to rather serve as edge ids. Comparing RDF datasets and domain graphs, the latter sacrifices the *“Graph as node”* feature without reserved vocabulary to reduce indexing permutations (discussed in Section 5). Property domain graphs further support external annotation, and better compatibility with legacy property graphs.

## 4. QUERY LANGUAGE

Per our goal of supporting multiple graph models, MillenniumDB aims to support a number of graph query languages. However, no existing query language would take full advantage of the property domain graph model defined in the previous section. We have thus implemented a base query language, called DGQL, which closely resembles Cypher [7], but is designed for the property domain graph model, and adds features of other query languages, such as SPARQL, that are commonly used for querying knowledge graphs [34, 5]. Herein we provide a guided tour of the syntax of DGQL. A full formal specification of the language can be found in the appendix of this paper.

To introduce the features of the query language, in Figure 11 we present (a snippet of) a bibliographical knowledge graph representing data about publications, authors, institutions, etc. The knowledge graph is represented as a property domain graph, where, for authorship relations, we use properties on the edge to indicate the author order, but directly link the edges (via their ids) to the organization node with which the author was affiliated for that particular paper (something not directly possible in property graphs). We further use abstract node and edge ids (*n*_{1,} …, *n*_{15}, *e*_{1,} …, *e*_{21}) for brevity, though these may be instantiated with application ids; for example, in Wikidata, the node *n*_{15} denoting the U.S. might rather have the id Q30.

We will use this knowledge graph as a running example in order to illustrate the MillenniumDB query language in the context of a bibliographical use-case, where we wish to analyze citations, find possible collaborators, etc.

### 4.1 Domain Graph Queries

A DGQL query takes the following high-level form:

When evaluated over a property domain graph, such a query will return a multiset of mappings binding Variables to database objects (or values) that satisfy the Pattern specified in the MATCH clause and the Filters specified in the WHERE clause.

*Querying objects..* The most basic query will return all the objects (or more precisely, their ids) in our property domain graph. In MillenniumDB we can achieve this via the following query:

Over the knowledge graph of Figure 11, this would return a table of all node and edge ids: *n*_{1,} …, *n*_{15}, *e*_{1,} …, *e*_{21}. Of course, one usually wants to select objects with a certain label, or a certain value in a specific property, as illustrated in the following example.

EXAMPLE 4.1. *The following DGQL query returns articles published in 1967 from Figure 11:*

This returns the ids of nodes with label article and value 1967 for the property year, along with their value for the property name, i.e., we return two results as follows:

?x | ?x.name |

n _{3} | “A Turing M. Sim.” |

n _{4} | Pr. Lang. for Aut.” |

?x | ?x.name |

n _{3} | “A Turing M. Sim.” |

n _{4} | Pr. Lang. for Aut.” |

*if, for example, n _{3} did not have a name, we would still return n_{3} as a result, leaving the corresponding value for ?x.name blank.*

If we wish to specify a range, we can rather use the WHERE clause, which allows us to specify conditions on the results returned.

EXAMPLE 4.2. *if we want to find articles published before 1990, we can use the following query:*

*This returns the same solutions over Figure 11 as in Example 4.1. if we were to replace “<“ with “<=“, we would receive a third result for n _{5} and “Add. Machines”.*

*Querying edges..* In order to query over edges, we can write the following query, which returns γ, i.e., the relation DOMAINGRAPH:

The RETURN ∗ operation projects all variables specified in the MATCH pattern, while the construct (?x)-[?e:?t]->(?y) specifies that we want to connect the object in ?x with an object in ?y, via an edge with type ?t and id ?e. This is akin to a query DoMAiNGRAPH(?x,?t,?y,?e) over the domain graph relation. Variable or constant edge types (e.g. ?t above) are prefixed by a colon.

EXAMPLE 4.3. *Over the graph of Figure 11, the aforementioned query would return results of the following form, with 21 results in total (one for each edge):*

?x | ?e | ?t | ?y |

n _{3} | e _{1} | venue | n _{1} |

n _{4} | e_{2} | venue | n _{1} |

n _{5} | e_{3} | venue | n _{2} |

… | … | … | … |

e _{6} | e_{11} | org | ne |

… | … | … | … |

n_{14} | e_{21} | loc | n_{15} |

?x | ?e | ?t | ?y |

n _{3} | e _{1} | venue | n _{1} |

n _{4} | e_{2} | venue | n _{1} |

n _{5} | e_{3} | venue | n _{2} |

… | … | … | … |

e _{6} | e_{11} | org | ne |

… | … | … | … |

n_{14} | e_{21} | loc | n_{15} |

This is akin to returning the DOMAINGRAPH relation.

We can also restrict which edges are matched, as shown in the following example.

EXAMPLE 4.4. *The following query in DGQL will return the ids and names of articles that cite an article of the same year:*

*Here we choose to omit the edge id variable as we do not need it (e.g., in the WHERE or RETURN clause). Over the knowledge graph of the running example, this returns a single result:*

In the next example, we illustrate two features together: the ability to return and specify conditions on edge properties, and the ability to query known objects.

EXAMPLE 4.5. *We look for the names of papers where Donald Knuth (n _{11}) is second or third author, returning his position in the author list:*

This query returns a single result:

As per the previous example, the WHERE clause may use Boolean combinations. We recall that the running example uses abstract node ids for brevity. In practice, the node id *n _{11}* could rather be an id such as Q17457, which identifies Donald Knuth on Wikidata.

*Path queries..* A key feature of graph databases is their ability to explore paths of arbitrary length. DGQL supports two-way regular path queries (2RPQs), which specify regular expressions over edge types, including concatenation (/), disjunction (|), inverses (Λ), optional (?), Kleene star (∗) and Kleene plus (+). We use =[]=> (rather than - [] ->) to signal a path query in DGQL.

EXAMPLE 4.6. *if we wish to find all of the citations of the article named “Add. Machines”, and their respective citations, and so on transitively, we can use the regular expression:cites+ in the following way, further returning the name and year of the articles where available:*

*This returns:*

?y | ?y.name | ?y.year |

n _{3} | “A Turing M. Sim. “ | 1967 |

n _{4} | Pr. Lang. for Aut.” | 1967 |

?y | ?y.name | ?y.year |

n _{3} | “A Turing M. Sim. “ | 1967 |

n _{4} | Pr. Lang. for Aut.” | 1967 |

DGQL can also return a single shortest path witnessing the query result by binding such a path to a variable.

EXAMPLE 4.7. *The following query extends that of Example 4.6 by binding paths witnessing each result to a variable ?p:*

*This returns a string representation of a shortest path for each result, as follows:*

We can also combine different features to capture more complex paths.

EXAMPLE 4.8. *The following query looks for publications of staff and students of U.S. institutions, further including the direct citations of these publications:*

*This query returns:*

*Notice that a shortest path to each node is returned, so an additional path to e.g. n _{4} using cites is not returned.*

The final example for paths illustrates operators nested inside a Kleene star.

EXAMPLE 4.9. *In order to find potential collaborators for a researcher, the following query finds shortest paths from that researcher (in this case Donald Knuth: n _{11}) to other authors via (transitive) citation or coauthorship relations:*

*This query returns:*

*A path that cycles back to Donald Knuth (n _{11}) is included. if we wished to filter such results, we could use WHERE to require the inequality ?y != n_{11}. Also there are two possible shortest paths for the n_{11} result (via n_{4} or via n_{5}), where the first such path to be found is returned.*

If we wished to return *all shortest paths,* we could use the DGQL keyword ALL before before the path variable. For instance, in the previous example, we can write [ALL ?p …], which returns a second result for *n*_{11} indicating the other shortest path.

Unlike Cypher, we can return paths matching 2RPQs, not just Kleene star. Unlike SPARQL, we can return paths, not just pairs of nodes. No manipulation of path variables, apart from outputting the result, is currently supported in MillenniumDB, but a full path algebra will be supported in future versions.

*Basic graph patterns..* Basic graph patterns [2] lie at the core of many graph query languages, including DGQL. Such graphs following the same structure as the data model, but allowing variables in any position. They can be seen as expressing natural (multi)joins over sets of atomic edge patterns. In DGQL, they are given in the MATCH clause. Basic graph patterns are evaluated under *homomorphism-based semantics* [2], which allows multiple variables in a result to map to the same element of the data.

EXAMPLE 4.10. *illustrating basic graph patterns, the following query finds pairs of co-authors:*

*if we evaluate this query over the running example, we get the following results:*

?x | ?y |

n _{9} | n _{9} |

n_{10} | n _{10} |

n_{10} | n_{11} |

n_{11} | n _{10} |

n_{11} | n_{11} |

n_{11} | n_{12} |

n_{12} | n_{11} |

n_{12} | n_{12} |

?x | ?y |

n _{9} | n _{9} |

n_{10} | n _{10} |

n_{10} | n_{11} |

n_{11} | n _{10} |

n_{11} | n_{11} |

n_{11} | n_{12} |

n_{12} | n_{11} |

n_{12} | n_{12} |

*Given the homomorphism-based semantics, results are returned that map both variables to the same author. if we wished to filter such results, we could stipulate the desired inequalities with WHERE ?x != ?y, which would filter the first, second, fifth and eighth result.*

The previous example could equivalently be expressed as a path of the form ?x=[:author/^:author]=>?y. However, with basic graph patterns, we can also capture branches and cycles, as illustrated in the following example.

EXAMPLE 4.11. *If we wanted to detect self-citations among journals in the same year, we could use the following DGQL query:*

*This would return:*

?x | ?x.name | ?y | ?y.name | ?z | ?z.name |

n_{4} | “Pr. Lang. for Aut.” | n _{3} | “A Turning M. Sim.” | n_{1} | “J. ACM” |

?x | ?x.name | ?y | ?y.name | ?z | ?z.name |

n_{4} | “Pr. Lang. for Aut.” | n _{3} | “A Turning M. Sim.” | n_{1} | “J. ACM” |

The DGQL query language allows us to take full advantage of domain graphs by allowing joins between edges, types, etc., as illustrated by the following example.

EXAMPLE 4.12. *The following query looks for articles with an affiliation that is current, i.e., where an author is still staff at the indicated organization:*

*The variable ?e invokes a join between an edge and a node, returning:*

*Navigational graph patterns.* If we further allow path queries within basic graph patterns, we arrive at *navigational graph patterns* [2].

EXAMPLE 4.13. *Suppose we wish to find, for example, instances of self-citation of staff at U.S. institutions; we could write this as follows:*

*This query would return:*

?w | ?w.name | ?y | ?y.name | ?z | ?z.name |

n_{11} | “D. Knuth” | n _{5} | “Add. Machines” | n _{4} | “Pr. Lang. for Aut.” |

?w | ?w.name | ?y | ?y.name | ?z | ?z.name |

n_{11} | “D. Knuth” | n _{5} | “Add. Machines” | n _{4} | “Pr. Lang. for Aut.” |

*Optional graph patterns..* Considering that the graph patterns considered previously allow for extracting tables from graphs, a way to enrich a graph query language is to support relational operators over these tables [2]; this gives rise to the notion of *relational graph patterns.* Thus far we have seen the ability to *project* results with SELECT, and apply *selections* over results with WHERE. We have also spoken about how basic and navigational graph patterns can be interpreted as natural joins in the relational algebra. DGQL also supports optional graph patterns, which behave akin to left outer joins in the relational algebra, i.e., they allow for extending solutions with data that may or may not be available; in case the data of the optional pattern are not available, the solution is still returned and the optional data are left blank.

EXAMPLE 4.14. *Assume we want to find the authors who have published articles in the Journal of the ACM, their affiliation in those articles, and, if available, the organization at which they are currently staff. The following DGQL query achieves this:*

*This query would return:*

?v | ?v.name | ?y | ?y.name | ?z | ?z.name |

n _{9} | “M. Curtis” | n _{6} | “Wesleyan U.” | ||

n _{10} | R. Bigelow” | n _{7} | “Cal. Tech.” | ||

n _{11} | D. Knuth” | n _{7} | “Cal. Tech.” | n _{8} | “Stanford U.” |

?v | ?v.name | ?y | ?y.name | ?z | ?z.name |

n _{9} | “M. Curtis” | n _{6} | “Wesleyan U.” | ||

n _{10} | R. Bigelow” | n _{7} | “Cal. Tech.” | ||

n _{11} | D. Knuth” | n _{7} | “Cal. Tech.” | n _{8} | “Stanford U.” |

*Results for n_{9} and n_{10} are still returned though they are not currently staff at any organization; the corresponding variables are left blank.*

Nested optional patterns are also supported. However, optional patterns must form *well-designed patterns* [35].

*Limits and ordering..* Some additional operators that MillenniumDB supports are LIMIT and ORDER BY. These allow us to limit the number of output mappings, and sort the obtained results, as illustrated by the following example.

EXAMPLE 4.15. *In order to find the most recent paper by Donald Knuth, we can use the following DGQL query:*

*The result returned is as follows:*

Ordering is always applied before limiting results.

### 4.2 Formal Definitions for DGQL

For readers interested in a formal specification, we provide the full definition of DGQL, together with the associated semantics, in the appendix to this paper. Specifically, in Appendix A we provide a grammar for DGQL queries, and in Appendix B we define (an equivalent) abstract syntax of DGQL and formal semantics of the language.

### 4.3 Comparing Graph Query Languages

A variety of query languages for graphs have been proposed in recent years [1, 2]. This section compares DGQL with six prominent query languages: Cypher [7] (Neo4j), SPARQL [30] (the standard query language for RDF Triple Stores), G-CORE [36] (LDBC), GSQL [11] (TigerGraph), Gremlin [12] (supported by several systems like Amazon Neptune and JanusGraph) and nGQL [37] (NebulaGraph). For each query language, we evaluate its support (total or partial) for six query features, namely: basic graph patterns, relational graph patterns, querying edges, regular path queries, navigational graph patterns, and full path recovery. Our comparison is shown in Table 2.

Query language . | BGP
. | RGP
. | QE
. | RPQ
. | NGP
. | FPR
. |
---|---|---|---|---|---|---|

DGQL | ✓ | ~ | ✓ | ✓ | ✓ | ~ |

Cypher | ✓ | ✓ | ~ | ~ | ~ | ~ |

SPARQL | ✓ | ✓ | ~ | ✓ | ✓ | X |

G-CORE | ✓ | ✓ | ~ | ✓ | ✓ | ✓ |

GSQL | ✓ | ~ | ~ | ✓ | ~ | X |

Gremlin | ✓ | ~ | ~ | ~ | ~ | ~ |

nGQL | ✓ | ✓ | ~ | ~ | ~ | ~ |

Query language . | BGP
. | RGP
. | QE
. | RPQ
. | NGP
. | FPR
. |
---|---|---|---|---|---|---|

DGQL | ✓ | ~ | ✓ | ✓ | ✓ | ~ |

Cypher | ✓ | ✓ | ~ | ~ | ~ | ~ |

SPARQL | ✓ | ✓ | ~ | ✓ | ✓ | X |

G-CORE | ✓ | ✓ | ~ | ✓ | ✓ | ✓ |

GSQL | ✓ | ~ | ~ | ✓ | ~ | X |

Gremlin | ✓ | ~ | ~ | ~ | ~ | ~ |

nGQL | ✓ | ✓ | ~ | ~ | ~ | ~ |

Every query language considered in Table 2 supports the notion of a basic graph pattern (BGP), which, in its most general form, is a graph pattern structured like the data, but allowing variables to replace constants. In most cases, the result of a basic graph pattern is a relation (or table) consisting of results, and in some cases it is possible to construct/return a graph (like in G-CORE and SPARQL).

Considering that a graph pattern extracts a table from a graph (as seen in the examples of Section 4.1), *relational graph patterns (RGPs)* allow the use of relational-based operators to combine the results of one or more graph patterns into a single relation. Full support of this feature in Table 2 indicates that a language provides join, optional, union and negation of graph patterns. Partial support indicates that a language supports some of these operators, usually join and optional graph patterns, as is the case for DGQL (we plan to extend this in future to support more relational operators).

*Querying edges (QE)* is a particular feature of DGQL, allowing for querying relationships involving edges. Notably, DGQL allows an id to be extracted as an edge in one part of the query, and then used as a node in another part. Other query languages provide partial support for querying edges, as they are restricted to query the labels and properties of the edges, require reserved vocabulary (reification), or have other restrictions (e.g., using named graphs in SPARQL over which paths cannot be resolved).

The *regular path queries (RPQs)* feature refers to matching paths based on (2-way) regular expressions, with concatenation, disjunction, inverse, optional and Kleene star. Partial support indicates that a language offers a restricted group of such operators, such as in the case of Cypher, which supports only Kleene star on top of a single edge type, and not over a subexpression, thus supporting an expression such as cites+, but not a more complex expression, such as (author/^ author)+.

We use the term *navigational graph patterns (NGPs)* to represent the combination of basic graph patterns and regular path queries. These queries are akin to conjunctive (2-way) regular path queries. NGPs are supported by DGQL, SPARQL and G-Core.

Finally, a query language with *full path recovery* allows not only to search for some paths, but also to return such paths as objects that can be manipulated (with the nodes and edges in a path). This is a particular feature of G-CORE as it supports path construction operations, and the data model permits storing paths. In Cypher, the resulting paths can be assigned to a variable, so the elements of each path can be accessed by using ad-hoc functions, although with reduced facilities. SPARQL does not support this feature as the output of a path expression is only the start and end nodes of each path. In GSQL and Gremlin, the result of a path query is a set of objects, so the resulting paths must be processed by using a programming language. Currently, DGQL partially supports this feature by returning a path as a string; however, MillenniumDB has been designed to support path manipulation in the future.

Table 2 focuses on core features for querying graphs [2], and thus omits features (e.g., borrowed from SQL) that are supported by some of the languages, and that are potentially very useful in practice, such as aggregations, solution modifiers, federation, etc. Such features can be layered atop the features mentioned.

## 5. SYSTEM ARCHITECTURE

In this section, we describe the internals of the MillenniumDB engine, which have been designed to efficiently support the domain graph model. The overall architecture of the system is presented in Figure 12, and will be explained in the following.

MillenniumDB is founded on tried and tested relational techniques: it stores the (property) domain graph model as several relations indexed in B+ trees, loading parts into main memory as needed using a fixed-size buffer. It also uses algorithmic techniques recently suggested in the theoretical literature for evaluating queries [38, 39]—techniques not typically implemented in graph database systems—for supporting the domain graphs model in practice. Specifically, we combine three different techniques that are new to the architecture of graph database systems when used in conjunction. First, the data model is encoded as basic relations, indexed following different attributes orders, wherein data objects (e.g., nodes, strings) are represented by ids. Second, we translate the evaluation of any query to several joins between basic relations, which we manage using *worst-case optimal join algorithms* [40]: an evaluation technique recently proposed for relational database systems. Last, we combine join algorithms with the evaluation of path queries by compiling the path pattern into an automaton and running the query on the fly. These techniques, together, are at the heart of how MillenniumDB optimizes queries over the domain graphs model in practice.

In what follows, we explain how one can store (property) domain graphs and index them. We then outline the query evaluation process and the algorithmic techniques it uses, like the worst-case optimal query plan and the evaluation of path queries.

*Storage and indexing..* Let us start by explaining the Disk and Storage Manager part of the MillenniumDB architecture from Figure 12. The main component of the domain graph data model are objects. Objects are represented internally as 8-byte identifiers. To optimize query execution, identifiers are divided into classes and the first byte of the identifier specifies a class it belongs to. The main classes in a property domain graph *G* = (O, γ, lab, prop) are:

*Nodes,*which are objects in the range of γ. They are divided into two subclasses:*named nodes,*which are objects in the domain graph for which an explicit name is available (e.g. Q320 in Wikidata), and*anonymous nodes,*which are internally generated objects without an explicit name available to the user (similar to blank nodes in RDF [41]).*Edges,*which are objects in the domain and range of γ, and are always anonymous, internally generated objects.*Values,*which are data objects like strings, integers, etc. These values are classified in two subclasses:*inlined values,*which are values that fit into 7 bytes of the identifier after the mask (e.g. 7 byte strings, integers, etc.), and*external values,*which are values longer than 7 bytes (e.g. long strings).

All records stored in MillenniumDB are composed of these identifiers. We will explain later how long strings for external values are handled.

To store property domain graphs, MillenniumDB deploys B+ trees [42]. For this purpose, we build a B+ tree template for fixed sized records, which store all classes of identifiers. To store a property domain graph *G* = (O, γ, lab, prop), we simply store and index in B+ trees the four components defining it:

OBJECTS(id) stores the identifiers of all the objects in the database (i.e., O).

DoMAiNGRApH(source,type,target,eid) contains all information on edges in the graph (i.e., γ), where eid is an edge identifier, and source, type, and target can be ids of any class (i.e., node, edge, or value). By default, four permutations of the attributes are indexed in order to aid query evaluation. These are: source-target-type-eid, target-type-source-eid, type-source-target-eid and type-target-source-eid.

LABELS(object,label) stores object labels (i.e., lab). The value of object can be any identifier, and the values of label are stored as ids. Both permutations are indexed.

PROPERTIES(object,property,value) stores the property-value pairs associated with each object (i.e., prop). The object column can contain any id, and property and value are value ids. Aside from indexing the primary key, an additional permutation is added to search objects by property-value pairs.

All the B+ trees are created through a bulk-import phase, which loads multiple tuples of sorted data, rather than inserting records one by one. In order to enable fast lookups by edge identifier, we use the fact that this attribute is the key for the relation. Therefore, we also store a table called EdgeTable, which contains triples of the form (source,type,target), such that the position in the table equals to the identifier of the object *e* such that γ(*e*) = (source,type,target). This implies that edge identifiers must be assigned consecutive ids starting from zero internally by MillenniumDB (they are not specified by the user). In total, we use ten B+ trees for storing the data.

To transform external strings and values (longer than 7 bytes) to database object ids and values, we have a single binary file called ObjectFile, which contains all such strings concatenated together. The internal id of an external value is then equal to the position where it is written in the ObjectFile, thus allowing efficient lookups of a value via its id. The identifiers are generated upon loading, and an additional hash table is kept to map a string to its identifier; we use this to ensure that no value is inserted twice, and to transform explicit values given in a query to their internal ids. Only strings are currently supported, but the implementation interface allows for adding support for different value types in a relatively simple manner.

All of the stored relations are accessed through linear iterators which provide access to one tuple at a time. All of the data is stored on pages of fixed (but parametrized) size (currently 4kB). The data from disk is loaded into a shared main memory buffer, whose size can be specified upon initializing the MillenniumDB server. The buffer uses the standard clock page replacement policy [42]. Additionally, for improved performance, upon initializing the server, it can be specified that the ObjectFile be loaded into main memory in order to quickly convert internal identifiers to string and integer values that do not fit into 7 bytes.

*Evaluating a query..* In MillenniumDB, the execution pipeline follows the standard database template where the string of the query is parsed and translated into a logical plan, which is then analyzed and converted into a physical plan, and finally evaluated, as illustrated by the Query Processor component of Figure 12.

A key part is in how the *patterns* and *filters* of a DGQL query (see Section 4.1) are evaluated. Specifically, patterns and filters are grouped together into a list of relations that can be edges, labels, properties, or path queries, forming a large *multi-way join* query. In essence, evaluating these joins is analogous to selecting an appropriate join plan for the relations representing the different elements. This also goes in hand with selecting the appropriate join algorithm for each of the joins. Given that edges, labels, and properties are all indexed, this will most commonly be index nested-loop join. Paths on the other hand are not directly indexed. For this reason, they are pushed to the end of the join plan and joined via nested-loop with the rest of the multi-way join^{⑦}.

MillenniumDB supports different mechanisms for evaluating the multi-way join formed by the pattern and filter of DGQL query.

A worst-case optimal query plan as described in [43] is used whenever possible. This approach implements a modified leapfrog algorithm [38] in order to minimize the number of intermediate results that are generated.

The classical relational optimizer, which is based on cost estimation, and tries to order base relations in such a way as to minimize the amount of (intermediate) results. We currently support two modes of execution here:

Two particular points of interest are the worst-case optimal query planner, and the way that paths are evaluated. Both of these deploy state-of-the-art research ideas that are usually not implemented in practical graph database systems (though some prototypes exist [43, 46, 47]). We provide some additional details on these next.

*Worst-case optimal query plan..* Evaluating multiple joins in a worst-case optimal way is done using a modified leapfrog algorithm [43]. While a classical join plan does a nested for-loop over relations, leapfrog performs a nested for-loop over variables [38]. Specifically, the algorithm first selects a variable order for the query, say (?x, ?y, ?z). It then intersects all relations where the first variable ?x appears, and over each solution for ?x returned, it intersects all relations where ?y appears (replacing ?x in its current solution), and so on to ?z, until all variables are processed and the final solutions are generated. We refer the reader to [38] and [43] for a detailed explanation. Two critical aspects for supporting this approach are indexes and variable ordering, explained next.

To support the leapfrog algorithm over traditional relational indexes such as B+ trees, we should index all relations in all possible orders of their attributes in order to ensure efficient intersections, which greatly increases disk storage [43]. In MillenniumDB, we include four orders for DomainGraph, and all orders for Labels and properties. With these orders we can cover the most common join-types that appear in practice [34] by a worst-case optimal query plan. We use the classical relational optimizer if the plan needs an unsupported order or one of the relations uses a path query.

The leapfrog algorithm further requires choosing a variable ordering, which is crucial for its performance. The heuristic we deploy for selecting the variable ordering mixes a greedy approach, and the ideas of the Graham-Yu-Özsoyoglu (GYO) reduction [48]. More precisely, we first order the variables based on the minimal cost of the relations they appear in and resolve ties by selecting the variable that appears in more distinct relations. The variables “connected” to the first one chosen are then processed in the same manner (where connected means appearing in the same relation) until the process can not continue. The isolated variables are then treated last.

*Evaluating path queries..* For evaluating a path query (2RPQ), the path pattern is compiled into an automaton. Then a “virtual” cross-product of this automaton and the graph is constructed on-the-fly, and navigated via breadth-first search, as commonly suggested in the theoretical literature [49, 50, 39] (our experiments in Section 6 will also test a depth-first search (DFS) variant). Our assumption is that each path pattern will have at least one of the endpoints assigned before evaluation. This can be done either explicitly in the pattern, or via the remainder of the query. For instance, a path pattern (Q1)=[P31∗]=>(?x) has the starting point of our search assigned to Q1. On the other hand, (?x)=[P31∗]=>(?y:person) does not have any of the endpoints assigned, however, the (?y:Person) allows us to instantiate ?y with any node with the label:person.

Intuitively, from a starting node (tagged with the initial state of the automaton), all edges with the type specified by the outgoing transitions from this state are followed. The process is repeated until reaching an end state of the automaton, upon which a result can be returned. This allows a fully pipelined evaluation of path queries, while only requiring at most a fixed amount of memory (the neighbors of the node on the top of the BFS queue). Additionally, the BFS algorithm also allows us to return a single shortest path between each pair of endpoints (see Section 4 for an example). Returning a single shortest path comes almost for free, given that it can be reconstructed using the set of visited nodes as used for bookkeeping in the BFS algorithm. The algorithm can also be extended to return all shortest paths (as supported by DGQL) by keeping a list of predecessors that reach the node via a path of shortest length.

The implemented algorithm only requires two permutations of the DOMAINGRAPH relation: one for retrieving all of a node's successors via an edge of a specified type; and another for retrieving all such predecessors of a given node.

## 6. BENCHMARKING

In this section, we provide an experimental evaluation of the core graph querying features of MillenniumDB addressing two key questions: (Q1) *Which join and path algorithms provide the best performance over domain graphs? (Q2) How does MillenniumDB's performance compare with existing graph database engines?*

We base our experiments on the Wikidata knowledge graph [21], which is one of the largest and most diverse real-world knowledge graphs that is publicly available, and also provides a public log of real-world queries posted by Wikidata users that we can use for experiments [22, 34]. The experiments focus on two fundamental query features: (i) basic graph patterns (BGPs); and (ii) path queries. Regarding (Q1), we compare the performance of different join and path algorithms within MillenniumDB. Regarding (Q2), we also provide a side by side comparison with several popular persistent graph database engines that support BGPs and at least the Kleene star feature for paths. We publish the data, queries, scripts, and configuration files for each engine online, together with the scripts used to load the data and run the experiments [26].

*Internal baselines..* The base of our comparison is the MillenniumDB implementation available at [25]. For comparing the performance of different join and path algorithms in MillenniumDB (per Q1), we include internal baselines, where we test: (i) MillenniumDB LF, which is the default version implementing the leapfrog triejoin algorithm; (ii) MillenniumDB GR, which implements the greedy algorithm for selecting the join order; and (iii) MillenniumDB SL, implementing the Sellinger join planner. Similarly, for path queries, we test (a) MillenniumDB BFS, the default version of the engine; and (b) MillenniumDB DFS, which evaluates path queries using the depth-first traversal.

*Other engines..* We also compare the performance of MillenniumDB with five persistent graph query engines (per Q2). First, we include three popular RDF engines: Jena TDB version 4.1.0 [18], Blazegraph (BlazeG for short) version 2.1.6 [16], and Virtuoso version 7.2.6 [20]. We further include a property graph engine: Neo4J community edition 4.3.5 [6].^{⑧} Finally, we also compare with Jena Leapfrog (Jena LF, for short)—a version of Jena TDB implementing a leapfrog-style algorithm [43]—in order to compare with an external graph database using a worst case optimal algorithm.

*The machine..* All experiments described were run on a single commodity server with an Intel®Xeon®Silver 4110 CPU, and 128GB of DDR4/2666MHz RAM, running the Linux Debian 10 operating system with the kernel version 5.10. The hard disk used to store the data was a SEAGATE ST14000NM001G with 14TB of storage.

*The data..* The base for our experiments is the Wikidata dataset. In particular, we used the truthy dump version 20210623-truthy-BETA [51], keeping only triples in which (i) the subject position is a Wikidata entity, and (ii) the predicate is a direct property. We call this dataset *Wikidata Truthy.* The size of the dataset after this process was 1,257,169,959 triples. The simplification of the dataset is done to facilitate comparison across multiple engines, specifically to keep data loading times across all engines manageable while keeping the nodes and edges necessary for testing the performance of BGPs and property paths. The size of the *Wikidata Truthy* dataset, when loaded into the respective systems, is summarized in Table 3. Default indices were used on Jena TDB, Blazegraph and Virtuoso. Jena LF stores three additional permutations of the stored triples to efficiently support the leapfrog algorithm for any join query, thus using more space. Neo4j by default creates an index for edge types (as of version 4.3.5). To speed up searches for particular entities and properties, we also created an index linking a Wikidata identifier (such as, e.g., Q510) to its internal id in Neo4j. We also tried to index literal values in Neo4j, but the process failed (the literals are still stored). MillenniumDB uses extra disk space because of the additional indices needed to support worst-case optimal join over domain graphs (similar to the case of Jena LF).

MillenniumDB . | BlazeG . | Jena . | Jena LF . | Virtuoso . | Neo4J . |
---|---|---|---|---|---|

203GB | 70GB | 110GB | 195GB | 70GB | 112GB |

MillenniumDB . | BlazeG . | Jena . | Jena LF . | Virtuoso . | Neo4J . |
---|---|---|---|---|---|

203GB | 70GB | 110GB | 195GB | 70GB | 112GB |

*How we ran the queries..* We detail the query sets used for the experiments in their respective subsections. To simulate a realistic database load, we do not split queries into cold/hot run segments. Rather, we run them in succession, one after another, after a cold start of each system (and after cleaning the OS cache). This simulates the fact that query performance can vary significantly based on the state of the system buffer, or even on the state of the hard drive, or the state of OS's virtual memory. For each system, queries were run in the same order. We record the execution time of each individual query, which includes iterating over all results. We set a limit of 100,000 distinct results for each query, again in order to enable comparability as some engines showed instability when returning larger results.

*Memory usage..* Blazegraph, Jena and Virtuoso were assigned 64GB of RAM, as is recommended. Neo4J was run with default settings^{⑨}, while MillenniumDB had access to 32GB for main-memory buffer, and it uses an additional 10GB for in-memory dictionaries. Since the systems tested are buffer-based—i.e., since they reserve a fixed amount of scrap space (the buffer) in main memory for their operation, and do not exceed this memory (except perhaps modulo a small amount used for internal operations)—and since they tend to use the buffer available, their maximum memory usage corresponds to the these settings. Thus, in the rest of this section we focus on comparing runtimes.

*Handling timeouts..* We defined a timeout of 10 minutes per query for each system. Apart from that, we note that most systems had to be restarted upon a timeout as they often showed instability, particularly while evaluating path queries. This was done without cleaning the OS cache in order to preserve some of the virtual memory mapping that the OS built up to that point. In comparison, MillenniumDB managed to return a non-trivial amount of query results on each query, and did not need to be restarted, thus handling timeouts gracefully.

### 6.1 Basic Graph Patterns

We focus first on basic graph pattern queries. To test different query execution strategies of MillenniumDB, we use two benchmarks: *Real-world BGPs* and *Complex BGPs,* which are described next.

*Real-world BGPs.* The Wikidata SPARQL query log contains millions of queries [22], but many are trivial to evaluate. We thus generate our benchmark from more challenging cases, i.e., a smaller log of queries that timed-out on the Wikidata public endpoint [22]. From these queries we extracted their BGPs, removing duplicates (modulo isomorphism on query variables). We distinguish queries consisting of a single triple pattern *(Single)* from those containing more than one triple pattern *(Multiple).* The former set tests the triple matching capabilities of the systems, whereas the latter set tests join performance. *Single* contains 399 queries, whereas *Multiple* has 436 queries.

*Real-world Single.*Table 4 (top) summarizes the query times on this set, whereas Figure 13 (left) shows boxplots with more detailed statistics on the distributions of runtimes. Since these queries do not require joins, we show one variant for MillenniumDB. MillenniumDB is the fastest overall (median of 0.05 s), followed by Blazegraph (median of 0.09 s). In terms of average times and higher percentiles, MillenniumDB more clearly outperforms other engines, being able to enumerate up to 100,000 results (the limit) more quickly due to decoding internal ids more quickly. In MillenniumDB, values such as P12 or Q10 that fit within 7 bytes are inlined, and do not need to be dictionary decoded. The remaining dictionary fits entirely in available memory (~8 GB of RAM). In the other systems, dictionary decoding generates random accesses to the disk. The four SPARQL engines tested must store IDs as IRIs within the RDF model, which include relatively long prefixes. However, since RDF datasets typically have few prefixes repeated often, we could support full IRIs within MillenniumDB with minimal overhead by encoding a prefix id for the top *k* prefixes in ⌈log_{2}(k)⌉ bits within the object identifier, keeping the small mapping from prefix id to string in memory. The Wikidata query service lists 32 prefixes, which would require 5 bits that would fit “for free” in the class byte (essentially considering each prefix to be a class).

Engine . | Supported . | Error . | Timeouts . | Average . | Median . |
---|---|---|---|---|---|

Real-world Single (399 queries) | |||||

MillenniumDB | 399 | 0 | 0 | 0.07 | 0.05 |

Blazegraph | 399 | 0 | 0 | 2.21 | 0.09 |

Jena | 399 | 0 | 0 | 14.10 | 0.34 |

Jena LF | 395 | 4 | 0 | 10.08 | 0.44 |

Virtuoso | 399 | 0 | 0 | 2.22 | 0.32 |

Neo4j | 394 | 5 | 0 | 28.00 | 1.33 |

Real-world Multiple (436 queries) | |||||

MillenniumDB LF | 436 | 0 | 0 | 4.84 | 0.24 |

MillenniumDB GR | 436 | 0 | 1 | 10.19 | 0.30 |

MillenniumDB SL | 436 | 0 | 1 | 10.04 | 0.27 |

Blazegraph | 436 | 0 | 3 | 31.79 | 2.42 |

Jena | 426 | 10 | 0 | 35.43 | 4.90 |

Jena LF | 418 | 18 | 0 | 16.78 | 3.39 |

Virtuoso | 436 | 0 | 0 | 7.87 | 5.11 |

Neo4j | 405 | 31 | 0 | 75.55 | 6.84 |

Complex (850 queries) | |||||

MillenniumDB LF | 850 | 0 | 0 | 0.38 | 0.10 |

MillenniumDB GR | 850 | 0 | 1 | 3.30 | 0.17 |

MillenniumDB SL | 850 | 0 | 1 | 3.51 | 0.17 |

Blazegraph | 850 | 0 | 2 | 4.63 | 0.34 |

Jena | 850 | 2 | 0 | 3.37 | 0.16 |

Jena LF | 850 | 0 | 0 | 0.88 | 0.14 |

Virtuoso | 850 | 0 | 0 | 1.00 | 0.19 |

Neo4j | 850 | 10 | 0 | 17.92 | 0.66 |

Engine . | Supported . | Error . | Timeouts . | Average . | Median . |
---|---|---|---|---|---|

Real-world Single (399 queries) | |||||

MillenniumDB | 399 | 0 | 0 | 0.07 | 0.05 |

Blazegraph | 399 | 0 | 0 | 2.21 | 0.09 |

Jena | 399 | 0 | 0 | 14.10 | 0.34 |

Jena LF | 395 | 4 | 0 | 10.08 | 0.44 |

Virtuoso | 399 | 0 | 0 | 2.22 | 0.32 |

Neo4j | 394 | 5 | 0 | 28.00 | 1.33 |

Real-world Multiple (436 queries) | |||||

MillenniumDB LF | 436 | 0 | 0 | 4.84 | 0.24 |

MillenniumDB GR | 436 | 0 | 1 | 10.19 | 0.30 |

MillenniumDB SL | 436 | 0 | 1 | 10.04 | 0.27 |

Blazegraph | 436 | 0 | 3 | 31.79 | 2.42 |

Jena | 426 | 10 | 0 | 35.43 | 4.90 |

Jena LF | 418 | 18 | 0 | 16.78 | 3.39 |

Virtuoso | 436 | 0 | 0 | 7.87 | 5.11 |

Neo4j | 405 | 31 | 0 | 75.55 | 6.84 |

Complex (850 queries) | |||||

MillenniumDB LF | 850 | 0 | 0 | 0.38 | 0.10 |

MillenniumDB GR | 850 | 0 | 1 | 3.30 | 0.17 |

MillenniumDB SL | 850 | 0 | 1 | 3.51 | 0.17 |

Blazegraph | 850 | 0 | 2 | 4.63 | 0.34 |

Jena | 850 | 2 | 0 | 3.37 | 0.16 |

Jena LF | 850 | 0 | 0 | 0.88 | 0.14 |

Virtuoso | 850 | 0 | 0 | 1.00 | 0.19 |

Neo4j | 850 | 10 | 0 | 17.92 | 0.66 |

*Real-world Multiple.*Table 4 (middle) and Figure 13 (middle) show the results for this set. Comparing different join execution strategies within MillenniumDB, we can see the superiority of the leapfrog triejoin variant (particularly on average, i.e., for more complex queries). The Selinger variant of MillenniumDB outperforms the greedy algorithm for join selection, but only marginally. Compared with existing graph engines, MillenniumDB clearly outperforms other systems on this query set. Its medians are an order of magnitude faster than those of Blazegraph, the next best contender. The difference is less sharp for averages, but MillenniumDB LF still takes 60% of the time of Virtuoso, the next best contender.

*Complex BGPs..* This is a benchmark used to test the performance of worst-case optimal joins [43]. Here, 17 different complex join patterns were selected, and 50 different queries generated for each pattern, resulting in a total of 850 queries. Figure 13 (right), and Table 4 (bottom), show the resulting query times. In this case, the difference between the join algorithms of MillenniumDB is more clear. The worst-case-optimal version (MillenniumDB LF) is not only considerably more stable than the other two versions, but also twice as fast in the median. We can also observe that MillenniumDB GR wins out over MillenniumDB SL on average (but not the median case). When comparing with other engines, the next-best competitor after MillenniumDB LF is Jena LF, showing the benefits of worst-case optimal joins. Virtuoso follows not far behind, while MillenniumDB GR, Jena, Blazegraph and Neo4j are considerably slower. Overall, MillenniumDB LF offers the best performance for every statistic shown in the plot.

### 6.2 Path Queries

To test the performance of path queries, we extracted 2RPQ expressions from a log of queries that timed out on the Wikidata endpoint [22]. The original log has 2110 queries. After removing queries that do not use direct properties (which are absent in the *Wikidata Truthy* dataset), we ended up with 1683 queries. These were run in succession, each restricted to return at most 100,000 results. In the case of SPARQL engines, we added the DISTINCT keyword to remove duplicates caused by the rewriting of fixed-length path queries to unions of BGPs that are then evaluated under bag semantics. To make the comparison fair, the DISTINCT keyword was also added in MillenniumDB queries. Each system was started after cleaning the system cache, and with a timeout of 10 minutes. Since these are originally SPARQL queries, not all of them were supported by Neo4J given the restricted regular-expression syntax it supports. MillenniumDB and Neo4J were the only systems able to handle timeouts without being restarted.^{⑩} In this comparison we do not include Jena LF since it uses the same execution strategy as Jena for property paths. Likewise, for MillenniumDB, we introduce two internal baselines for breadth-first search (BFS) and depth-first search (DFS). The experimental results for these path queries are summarized in Table 5 and Figure 14.

Engine . | Supported . | Error . | Timeouts . | Average . | Median . |
---|---|---|---|---|---|

MillenniumDB BFS | 1683 | 0 | 0 | 1.1 | 0.095 |

MillenniumDB DFS | 1683 | 0 | 0 | 1.1 | 0.072 |

Blazegraph | 1683 | 2 | 44 | 27.6 | 0.396 |

Jena | 1683 | 14 | 46 | 22.8 | 0.207 |

Virtuoso | 1683 | 55 | 4 | 5.8 | 0.325 |

Neo4J | 1622 | 0 | 42 | 23.3 | 0.328 |

Engine . | Supported . | Error . | Timeouts . | Average . | Median . |
---|---|---|---|---|---|

MillenniumDB BFS | 1683 | 0 | 0 | 1.1 | 0.095 |

MillenniumDB DFS | 1683 | 0 | 0 | 1.1 | 0.072 |

Blazegraph | 1683 | 2 | 44 | 27.6 | 0.396 |

Jena | 1683 | 14 | 46 | 22.8 | 0.207 |

Virtuoso | 1683 | 55 | 4 | 5.8 | 0.325 |

Neo4J | 1622 | 0 | 42 | 23.3 | 0.328 |

In terms of our internal comparison, we can see that the DFS algorithm slightly outperforms BFS. The reason for keeping BFS as the default algorithm is twofold: (i) it significantly outperforms DFS when paths are also returned; and (ii) it supports returning all shortest paths between any pair of nodes. To illustrate point (i), we ran our experiments again, but now also returning a single path witnessing each query answer. In this case, the average for BFS is 5.9 sec, and median is 0.086 sec. On the other hand, when paths are returned, DFS takes 7.9 sec on average, and 0.1 sec median time.

Compared with other engines, MillenniumDB is generally the fastest, and has the most stable performance. Its average is near a second, i.e., five times faster than the next best contender (Virtuoso). Its median, below 0.1 seconds, is half the next one (Jena's). Even after removing the queries that timed-out on the other systems, they are considerably slower than MillenniumDB. In particular, if we only consider the queries that run successfully on Virtuoso (i.e., excluding the 59 queries that timed-out or gave an error), we get an average time of 0.85 seconds and a median time of 0.086 seconds on MillenniumDB: less than half the times of Virtuoso with these queries excluded. The boxplots further show the stability of MillenniumDB: the medians of other engines are above the third quartile of MillenniumDB. Their third quartile is 5-10 times higher than MillenniumDB's, and higher than its topmost whisker.

To further test robustness, we also ran all of the queries *without limiting the output size* on MillenniumDB. In this test, the engine timed out in only 15 queries, each returning between 800 thousand and 44 million results before timing out. When running queries to completion, MillenniumDB BFS averaged 13.4 seconds per query (8 seconds excluding timeouts), with a median of 0.1 seconds (both with and without timeouts).

### 6.3 Wikidata Complete

To show the scalability of MillenniumDB, and to property leverage its domain graph, we ran experiments with a full version of Wikidata. We call this dataset *Wikidata Complete,* and base it off the Wikidata JSON dump^{⑪} version 20201102-all.json, which is preprocessed and mapped to our data model. In *Wikidata Complete,* we model qualifiers (i.e. edges on edges), put labels on objects, and assign them properties with values. We use properties to store the language value of each string in Wikidata, and also to model elements of complex data values (e.g., for coordinates we would have objects with properties latitude and longitude, and similarly for amounts, date/time, limits, etc.). Each object representing a complex data value also has a label specifying its data type (e.g. coord for geographical coordinates). All qualifiers were loaded. The only elements excluded from the full Wikidata data dump were sitelinks and references. This full version of Wikidata resulted in a knowledge graph with roughly 300 million objects, participating in 4.3 billion edges. The total size on disk of this data was 827GB in MillenniumDB, i.e., more than four times larger than *Wikidata Truthy.* More details about this dataset can be found in the online material accompanying this paper [26].

We ran the same queries from the benchmarks *(Single, Multiple* and *Complex* BGPs, as well as *Paths).* The number of outputs on the two versions of the data, while not the same, was within the same order of magnitude averaged over all the queries. The results are presented in Table 6. As we can observe, MillenniumDB shows no deterioration in performance when a larger database is considered for similar queries. This is mostly due to the fact that the buffer only loads the necessary pages into the main memory, and will probably require a rather similar effort in both cases. We also note that, again, no queries resulted in a timeout over the larger dataset.

### 6.4 Discussion

Regarding (Q1)—i.e., which join and path algorithms provide the best performance in this setting— regarding join algorithms, we can conclude that the worst-case optimal join algorithm consistently outperforms the greedy and Selinger variants, being particularly notable in the case of more complex graph patterns with many joins (wherein Jena LF—also worst-case optimal—was the next best competitor). Worst-case optimal joins use more space for indexing, but provide superior query runtimes. Regarding path algorithms, we see less difference between BFS and DFS: DFS is slightly faster for returning pairs of nodes connected by paths, while BFS is faster for returning paths.

Regarding (Q2)—i.e., how existing graph database systems compare with MillenniumDB—we found that MillenniumDB, when equipped with the best join and path algorithms, consistently outperforms other competitors in all query sets tested.

## 7. CONCLUSIONS AND LOOKING AHEAD

This paper presents MillenniumDB, an open-source graph database system with persistent storage implementing the novel (property) domain graph model.

Domain graphs adopt the natural idea of adding edge ids to directed labeled edges in order to concisely model higher-arity relations in graphs, as needed in Wikidata, without the need for reserved vocabulary or reification. They can naturally represent popular graph models, such as RDF and property graphs, and allow for combining the features of both models in a novel way. While the idea of using edge ids as a hook for modeling higher-arity relations in graphs is far from new (see, e.g., [23, 32, 33]), it is an idea that is garnering increased attention as a more flexible and concise alternative to reification. Our work proposes a formal data model that incorporates edge ids, a query language that can take advantage of them, and a fully-fledged graph database engine that supports them by design. We also propose to optionally allow (external) annotations on top of the graph structure, thus facilitating better compatibility with property graphs, whereby labels and property-values can be added to graph objects without adding new nodes and edges to the graph itself.

We have also proposed a new query language with a syntax inspired by Cypher, but that additionally enables users to take full advantage of the domain graph model by (optionally) referencing edge ids in their queries, and performing joins on any element of the domain graph. We further combine useful features present in both Cypher and SPARQL, in order to provide additional expressivity, such as returning the shortest path witnessing a result for a path query (as captured by a 2RPQ expression).

In the implementation of MillenniumDB, we combine both tried-and-trusted techniques that have been successfully used in relational database pipelines for decades [42] (e.g., B+ trees, buffer managers, etc.), with promising state-of-the-art algorithms for computing worst case optimal joins (leapfrog [38]) and evaluating path queries (guided by an automaton [49, 50]). Our experiments over Wikidata, considering real-world queries and data at large-scale, show that this combination outperforms other persistent graph database engines that are commonly found in practice.

*Limitations..* Many of the current limitations of MillenniumDB relate to the fact that it is still under development. For example, at the moment, MillenniumDB only supports a bulk load of data, where support for (incremental) updates is currently under investigation and development. Currently only the core features of query languages such as Cypher and SPARQL are supported, where we are working on adding support for other features, including negation, value assignment, functions on datatypes, etc. MillenniumDB lacks some of the advanced features supported by other graph database systems, such as geographic, temporal and federated queries, keyword search features, etc. Finally, MillenniumDB does not yet support partitioning the graph over multiple machines in order to achieve horizontal scaling. We do not see such limitations as fundamental, but rather as features that can be added to the engine over time.

*Future work..* Looking to the future, we foresee extensions such as: returning entire graphs, supporting more complex path constraints, returning sets of paths, path algebra, just to name a few. Regarding more practical features, we aim to add support for full transactions, keyword search, a graph update language, existing graph query languages, and more besides. More importantly, given that MillenniumDB is published as an open source engine, we hope that the research community can view the MillenniumDB code base as a sandbox for incorporating their novel algorithms and ideas into a modern graph database, without the need to remake storage, indexing, access methods, or query parsers. Along these lines, we are currently working on adding an in-memory storage option to MillenniumDB using the ring [47]: a data structure based on the Burrows—Wheeler transform that supports worse-case optimal joins (over triples) in space similar to representing the graph itself. Initial tests show that the ring can store *Wikidata Truthy* in 50GB of space and improve median query times by a factor of 3, with average query times remaining similar. We are working on extending the ring to support edge ids and thus work with domain graphs. We also wish to explore the deployment of MillenniumDB for key use-cases; for example, we plan to provide and host an alternative query service for Wikidata, which may help to prioritize the addition of novel features and optimizations as needed in practice.

## ACKNOWLEDGEMENTS

This work was supported by ANID—Millennium Science Initiative Program—Code ICN17_002.

## AUTHOR CONTRIBUTIONS

Vrgoč acted as the project leader, designed the architecture of MillenniumDB, and adapted the theoretical algorithms used in this work for implementation purposes. Rojas was the lead engineer, working in designing the architecture of MillenniumDB, and implementing the majority of the system. Angles participated in writing the paper, designing the data model and designing the query language. Arenas participated in writing the paper, designing the data model and designing the query language. Arroyuelo participated in designing and analyzing the experiments, and writing the paper. Buil-Aranda participated in writing the paper, designing the data model, and designing experiments. Hogan participated in writing the paper, designing the data model, and designing experiments. Navarro participated in writing the paper and analyzing experiments. Riveros participated in writing the paper and designing the architecture of MillenniumDB. Romero helped in implementing path queries.

See, e.g., https://db-engines.com/en/ranking/graph+dbms

Domain graphs provide a concrete data model to use for representing knowledge graphs, and are thus an alternative to RDF graphs, property graphs, etc., but one that—as we will argue in Section 3—encapsulates such models while addressing some of their key limitations in this setting.

For simplicity, we will often not distinguish restrictions on different types of objects when not pertinent.

A proposed workaround involves adding intermediate nodes to denote different *occurrences* of quoted triples, but this requires a reserved term [29].

Herein, we say *“edge type”* rather than *“edge label”* to highlight that the type forms part of the edge, rather than being an annotation on the edge, as in property graphs.

To represent edges in property graphs that permit multiple labels, multiple edges with different types can be added (or the labels can be added on the edge ids).

While not the best option, based on empirical evidence (see Section 6), this solution seems to be adequate in practice.

Though TigerGraph meets the technical requirements, its license currently restricts benchmarking and thus it is excluded.

We also tried increasing the dbms.memory.pagecache.size parameter manually to 64GB, and setting dbms.memory.heap. initial_size and dbms.memory.heap.max_size to 30GB each, but the variation in the runtimes between the two settings was less than 0.5%. We believe that this is because in both cases, Neo4j manages to run in main memory without swapping, so varying these configurations has little effect.

In fact, MillenniumDB did not give any timeouts. However, we re-ran the experiments with a lower timeout, and observed that the system could recover from interrupting the query gracefully and was able to return the results found before being interrupted.

It is important to note that JSON and RDF dumps of Wikidata do not result in precisely the same knowledge graph due to some restrictions of the particular reification used in RDF; however, they do result in very similar knowledge graphs.