MillenniumDB: An Open-Source Graph Database System

ABSTRACT In this systems paper, we present MillenniumDB: a novel graph database engine that is modular, persistent, and open source. MillenniumDB is based on a graph data model, which we call domain graphs, that provides a simple abstraction upon which a variety of popular graph models can be supported, thus providing a flexible data management engine for diverse types of knowledge graph. The engine itself is founded on a combination of tried and tested techniques from relational data management, state-of-the-art algorithms for worst-case-optimal joins, as well as graph-specific algorithms for evaluating path queries. In this paper, we present the main design principles underlying MillenniumDB, describing the abstract graph model and query semantics supported, the concrete data model and query syntax implemented, as well as the storage, indexing, query planning and query evaluation techniques used. We evaluate MillenniumDB over real-world data and queries from the Wikidata knowledge graph, where we find that it outperforms other popular persistent graph database engines (including both enterprise and open source alternatives) that support similar query features.


Introduction
Recent years have seen growing interest in graph databases [4], wherein nodes represent entities of interest, and edges represent relations between those entities.In comparison with alternative data models, graphs offer a flexible and often more intuitive representation of particular domains [3].Graphs forgo the need to define a fixed (e.g., relational) schema for the domain upfront, and allow for modeling and querying cyclical relations between entities that are not well-supported in other data models (e.g., tree-based models, such as XML and JSON).Graphs have long been used as an intuitive way to model data in domains such as social networks, transport networks, genealogy, biological networks, etc. Graph databases further enable specific forms of querying, such as path queries that find entities related by arbitrary-length paths in the graph.Graph databases have become popular in the context of NoSQL [11], where alternatives to relational databases are sought for specialized scenarios; Linked Data [20], xxx Contributions The contributions of this paper are as follows: • the domain graph and property domain graph data models, which allow for succinctly representing graph data models popular in practice, including RDF graphs, RDF-star graphs, property graphs, and the Wikidata knowledge graph [49]; • a formal query language based on domain graphs that captures key features of popular query languages for graph databases, along with a concrete query syntax; • an indexing scheme and query engine designed for domain graphs that incorporates both traditional and state-of-the-art techniques, with optimizations dedicated to the evaluation of graph patterns and path queries; • experiments over the Wikidata knowledge graph [49], involving real-world graph data and queries, comparing algorithms internal to MillenniumDB as well as other graph database engines.
Our experimental results highlight the benefits, for example, of incorporating worst-case-optimal join algorithms when evaluating complex graph patterns (with many joins) versus a more traditional approach based on applying binary joins with a Selinger-based query engine.We further compare the performance associated with different graph search algorithms in the context of path queries.On a more practical note, we show that MillenniumDB, under optimal configurations, clearly outperforms prominent graph database systems -namely Blazegraph, Neo4j, Jena and Virtuosoand discuss why.We further publish a first release of MillenniumDB as an open source graph database engine [41], which we plan to extend in future in order to support more query syntax, query features, transactional updates, index structures, and more.
Paper structure The rest of this paper is structured as follows: In Section 2 we describe existing graph data models and their limitations.In Section 3, we propose domain graphs as an abstraction of these models used in MillenniumDB.In Section 4, we describe the query language of MillenniumDB, and how it takes advantage of domain graphs.In Section 5, we explain how MillenniumDB stores data and evaluates queries.In Section 6, we provide an experimental evaluation of the proposed methods on a large body of queries over the Wikidata knowledge graph.In Section 7, we provide some concluding remarks and ideas for future research.

Data Availability & Supplementary Material Statement
The source code of Millen-niumDB is provided in full at [41].Experimental data is given at [42].

Existing graph data models and their limitations
In this section we briefly recap the popular graph data models in use today, and discuss their limitations when modeling real-world datasets.

Graph data models
RDF and RDF*.One of the simplest models used for representing knowledge graphs is based on directed labeled graphs, composed of a set of edges of the form a Such graphs are the basis of RDF [12], where the source node, edge label and target node are called subject, predicate and object, respectively.Given a universe Obj of objects (ids, strings, numbers, IRIs, etc.) 3 the RDF data model is defined as follows: Definition 1.An RDF triple is an element (s, p, o) ∈ Obj × Obj × Obj.An RDF graph is a finite set of RDF triples.
Upon analyzing this definition, we can immediately notice that the RDF data model lacks the ability to directly refer to the edge (s, p, o) itself.For instance, if we wanted to add the information about when the presidency represented by the above edge starts and when it ends, we would have to resort to some sort of reification, which would introduce an artificial object representing the edge that can then be linked to the start and end date information.For example, the (reified) triples representing the duration of this presidency could be represented as shown in Figure 2. The reification is given by the use of the edges labeled as source, label and target.In order to avoid the need for reification, an extension of the RDF data model called RDF* (or RDF-star) was proposed [18].Intuitively, in RDF* an entire triple can appear as a subject or an object in another triple.For example, in Figure 3, we are modeling the fact that Michelle Bachelet was the president of Chile from 2014-03-11 to 2018-03-11.The node representing the edge is called a quoted triple [19].To distinguish edges that originate in a quoted triple, in Figure 3 we denote them with a dotted line.Formally, the RDF* data model can be defined as follows: Definition 2 ([18]).An RDF* triple is defined recursively as follows:

Michelle
• An RDF triple (s, p, o) is an RDF* triple; and • If s, o are RDF* triples or elements of Obj, and p ∈ Obj, then (s, p, o) is an RDF* triple.Another model extending RDF is that of RDF datasets [12], which are typically used to represent and manage multiple named RDF graphs.This model can be defined in two manners.The first, most general, definition permits empty graphs.Definition 3.An RDF dataset is defined as a pair D = (G, {(n 1 , G 1 ), . . ., (n k , G k )}) where: • G, G 1 , . . ., G k are RDF graphs; and • n 1 , . . ., n k are objects such that n i ̸ = n j for 1 ≤ i < j ≤ k.
The graph G is called the default graph, while each pair (n i , G i ) is called a named graph, composed of the name n i and its corresponding RDF graph G i .
For example, letting G denote the RDF graph of Figure 1, and letting G 1 denote the RDF graph of Figure 2, then we can capture both graphs separately in an RDF dataset of the form D = (G, {(n 1 , G 1 )}), where n 1 is a name (e.g., reified) used to reference the graph G 1 .It is common to simply represent RDF datasets as a set of quads of the form (s, p, o, g) ∈ Obj × Obj × Obj × Obj [17], which indicates that the RDF triple (s, p, o) is in the RDF graph with name g.In this quad-based view, for example, D would then contain the quad (e 1 , source, Michelle Bachelet, reified).A special name can be reserved in order to denote the default graph; for example (Michelle Bachelet, position held, President of Chile, default).This quad-based definition cannot directly support naming empty RDF graphs (though it could be extended to incorporate a set of names for empty graphs).
Property graphs.Finally, one of the more popular graph data models is that of property graphs [15].Property graphs extend the simple edge labeled directed graph with two additional features: (i) they assign explicit identifiers to nodes and edges, so that one can refer to them; and (ii) they allow for annotating both nodes and edges with a set of property-value pairs.For example, the information from Figure 3 can be equivalently represented by the property graph in Figure 4.Here the nodes have identifiers (n 1 , n 2 ) as well as labels (human, public office).Similarly, edges have both identifiers (e 1 ) and labels (position held).A node can have multiple labels, while an edge always has a single label (often referred to as its type).The edge e1 has two properties, namely start date and end date, each with an associated value.Formally, if Obj is a set of objects, L is a set of labels, P a set of properties, and V a set of values, we define the property graph data model as follows: Definition 4. A property graph is a tuple G = (V, E, src, tgt, lab, prop), where: • V ⊂ Obj is finite set of node identifiers;  • E ⊂ Obj is finite set of edge identifiers disjoint from V ; • src : E → V assigns a source node to each edge; • tgt : E → V assigns a target node to each edge; • lab : (V ∪ E) → 2 L is a function assigning a finite set of labels to nodes and edges, with |lab(e)| = 1 for all e ∈ E; and • prop : (V ∪ E) × P → V is a partial function assigning a value to a certain property of a node or an edge.
Moreover, we assume that for each object o ∈ V ∪ E, there exists a finite number of properties p ∈ P such that prop(o, p) is defined.

Limitations of existing models
While all of the described data models have great expressive power, they are sometimes cumbersome to use when representing real-world datasets that contain higherarity relations.To illustrate this, we will use the Wikidata [49,34,23] knowledge graph.Consider the two Wikidata statements shown in Figure 5.Both statements claim that Michelle Bachelet was a president of Chile, and both are associated with nested qualifiers that provide additional information: in this case a start date, an end date, who replaced her, and whom she was replaced by.There are two statements for two distinct presidencies.Also the ids for objects (for example, Q320 and P39) are shown; any positional element can have an id and be viewed as a node in the knowledge graph.As aforementioned, representing statements like this in RDF graphs requires reification to decompose n-ary relations into binary relations [21].Figure 6 shows a graph where e 1 and e 2 are nodes representing two distinct n-ary relationships (an extended version of Figure 2).For greater readability, we use human-readable nodes and labels, where in practice, the node Sebastián Piñera will rather be given as the identifier Q306 , and the edge type "replaces" will rather be given as "P155".
Since property graphs allow labels and property-value pairs to be associated with both nodes and edges, reification can be avoided in our example.For instance, the statements of Figure 5 can be represented as the property graph in Figure 7. Though more concise than reification, labels, properties and values are considered to be simple strings, which are disjoint with nodes; for example, Ricardo Lagos is neither a node nor   only represent one of the statements (without reification), as we can only have one distinct node per edge; if we add the qualifiers for both statements, then we would not know which start date pairs with which end date, for example. 4xx Regarding RDF datasets, we could model both statements by creating two named graphs, each with a copy of the statement that Michele Bachelet has been President of Chile, thereafter defining the start date, end date, replaced by and replaces annotations in another graph using the graph name.The resulting quads could thus be as follows if we define the latter information in the default graph (for example): This is quite a concise way to model the aforementioned Wikidata statements, wherein we effectively use graph names to assign each edge a unique id that serves as a graph node elsewhere.Indeed, the data model we propose follows a similar idea.However, RDF datasets were defined in the context of managing several (named) graphs, where using them to define edge ids gives rise to several complications; for example, SPARQL does not support evaluating path queries that span different named graphs.

Data model underlying MillenniumDB
In this section, we present the graph data model upon which MillenniumDB is based, called domain graphs, and discuss how it generalizes existing graph data models such as RDF and property graphs.We also show its utility in concisely modeling real-world knowledge graphs that contain higher-arity relations, such as Wikidata [49].

Domain Graphs
The structure of knowledge graphs is captured in MillenniumDB via domain graphs, which follow the natural idea of assigning ids to edges in order to capture higher-arity relations within graphs [21,25,26,5].Formally, assume a universe Obj of objects (ids, strings, numbers, IRIs, etc.).We define domain graphs as follows: Intuitively, O is the set of database objects and γ models edges between objects.If γ(e) = (n 1 , t, n 2 ), this states that the edge (n 1 , t, n 2 ) has id e, type t, and links the source node n 1 to the target node n 2 . 5We can analogously define our model as a relation: DOMAINGRAPH(source, type, target, eid) where eid (edge id) is a primary key of the relation.
The domain graph model of MillenniumDB already subsumes the RDF graph model [12].Recall that an RDF graph is a set of triples of the form (a, b, c).To show how RDF is modeled in domain graphs, consider again the RDF triple from The id of the edge itself is not needed in the RDF data model, but it can be used for modeling RDF-star (RDF*) graphs [18,19].For example, to represent the RDF* graph from Figure 8, we can extend the function γ with two additional statements: Here we use two new edges, e1 and e2, which have the edge e as their starting node.
For stricter backwards compatibility with legacy property graphs (where desired), MillenniumDB implements a simple extension of the domain graph model, called property domain graphs, which allows for external annotation, i.e., adding labels and propertyvalue pairs to nodes and edges without creating new nodes and edges.Formally, if L is a set of labels, P a set of properties, and V a set of values, we define a property domain graph as follows: Definition 6.A property domain graph is defined as a tuple G = (O, γ, lab, prop), where: • lab : O → 2 L is a function assigning a finite set of labels to an object; and • prop : O × P → V is a partial function assigning a value to a certain property of an object.Moreover, we assume that for each object o ∈ O, there exists a finite number of properties p ∈ P such that prop(o, p) is defined.
While domain graphs (without properties) can directly capture property graphswhere, for example, the property-value pair (gender, "female") on node n 1 can be represented by an edge γ(e 3 ) = (n 1 , gender, female), the property-value pair (order, "2") on an edge e 2 becomes γ(e 4 ) = (e 2 , order, 2), the label on n 1 becomes γ(e 5 ) = (n 1 , label, human), the label on e 1 becomes the type of the edge γ(e 1 ) = (n 1 , father, n 2 ), etc. 6 -this can generate "incompatibilities" between the legacy property graph and the resulting domain graph; for example, strings like "male", labels like human, etc., now become nodes in the graph, generating new paths through them that may affect query results.Property domain graphs thus offer an extra layer of flexibility, and interoperability with legacy property graphs, where needed for a given use-case.xxx To illustrate how property domain graphs work, consider the property graph (as introduced in Definition 4) from Figure 9.To model this information via property domain graphs, we use the domain graph part to capture the graph structure of our model, while property domain graphs also permit annotating that graph structure with labels and property-value pairs.The property graph in Figure 9 can be represented with the following property domain graph G = (O, γ, lab, prop), where the graph structure is as follows: and the annotations of the graph structure are as follows: prop(n2, gender) = "male" prop(e2, order) = "2" prop(n2, children) = "2" prop(n1, gender) = "female" prop(n2, first name) = "Alberto" prop(n1, children) = "3" prop(n2, last name) = "Bachelet" prop(n1, first name) = "Michelle" prop(n2, death) = "12 March 1974" The relational representation of property domain graph then adds two new relations alongside DOMAINGRAPH: LABELS(object, label), PROPERTIES(object, property, value), where object, property is a primary key of the second relation, with the first relation allowing multiple labels per object.

Domain graphs compared with other graph data models
Why did we choose (property) domain graphs as the model of MillenniumDB?As discussed in the previous section, it can be used to model both directed labeled graphs (like RDF) as well as property graphs.It also has a natural relational expression, which facilitates its implementation in a query engine.But it is also heavily inspired by the needs of real-world knowledge graphs like Wikidata [49,34,23].To illustrate its versatility, consider again the Wikidata statements shown in Figure 5.Note that edges that originate in another edge are drawn with a dotted line.
As discussed in Section 2, neither RDF nor RDF* can represent these statements without resorting to reification, while property graphs cannot take nodes as values for properties.The domain graph model allows us to capture higher-arity relations more directly.In Figure 10 we present one possible representation of the statements from Figure 5.We only show edge ids as needed (all edges have ids).We do not use the "property part" of our data model for external annotation, considering that the elements of Wikidata statements shown can form nodes in the graph itself.
Domain graphs are similar to named graphs in RDF datasets.Both domain graphs and RDF datasets can be represented as quads.However, the edge ids of domain graphs identify each quad, which, as we will discuss in Section 5, necessitates fewer index permutations.RDF datasets were proposed to represent multiple RDF graphs for publishing and querying.SPARQL thus does not support querying paths that span different named graphs; to support path queries over singleton named graphs, all edges would need to be duplicated (virtually or physically) into a single graph [21].Named graphs could be supported in domain graphs using a reserved term graph, and edges of the form γ(e 3 ) = (e 1 , graph, g 1 ), γ(e 4 ) = (e 2 , graph, g 1 ); optionally, named domain graphs could be considered in the future to support multiple domain graphs.The idea of assigning ids to edges/triples for similar purposes as described here is a natural one, and not new to this work.Hernandez et al. [21] explored using singleton named graphs in order to represent Wikidata qualifiers, placing one triple in each named graph, such that the name acts as an id for the triple.In parallel with our work, recently a data model analogous to domain graphs has been independently proposed for use in Amazon Neptune, which the authors call 1G [26].Their proposal does not discuss a formal definition for the model, nor a query language, storage and indexing, implementation, etc., but the reasoning and justification that they put forward for the model is similar to ours.Similar such models have been generalized as multilayer graphs [5], where the appearance of edge ids within the graph induces different layers of reference.Our work proposes a novel query language, storage and indexing schemes, query planner -and ultimately a fully-fledged graph database engine -built specifically for this model.Furthermore, with property domain graphs, we support annotation external to the graph, which we believe to be a useful extension that enables better compatibility with property graphs.6 (all features except External annotation can be supported in all models with reserved vocabulary).Reserved terms can add indirection to modeling (e.g., reification [21]), and can clutter the data, necessitating more tuples or higher-arity tuples to store, leading to more joins and/or index permutations.The features are then defined as follows, considering directed (labeled) edges: • Edge type/label: assign a type or label to an edge.
• Node label: assign labels to nodes.
• Edge annotation: assign property-value pairs to an edge.
• Node annotation: assign property-value pairs to a node.
• External annotation: nodes/edges can be annotated without adding new nodes or edges.
• Edge as node: an edge can be referenced as a node (this allows edges to be connected to nodes of the graph).
• Edge as nodes: a single unique edge can be referenced as multiple nodes.
• Nested edge nodes: an edge involving an edge node can itself be referenced as a node, and so on, recursively.
• Graph as node: a graph can be referenced as a node.
Some unsupported features in Table 1 are more benign than others; for example, Node label requires a reserved term (e.g., rdf:type), but no extra tuples; on the other hand, Edge as node requires reification, using at least one extra tuple, and also a reserved term.
Wikidata requires Edge as nodes as per Figure 5, where values like Ricardo Lagos are themselves nodes.Only RDF datasets, domain graphs and property domain graphs can model such examples without reserved terms; however, the use of RDF datasets requires co-opting graph names, which are typically used to manage multiple graphs, to rather serve as edge ids.Comparing RDF datasets and domain graphs, the latter sacrifices the "Graph as node" feature without reserved vocabulary to reduce indexing permutations (discussed in Section 5).Property domain graphs further support external annotation, and better compatibility with legacy property graphs.

Query language
Per our goal of supporting multiple graph models, MillenniumDB aims to support a number of graph query languages.However, no existing query language would take full advantage of the property domain graph model defined in the previous section.We have thus implemented a base query language, called DGQL, which closely resembles Cypher [15], but is designed for the property domain graph model, and adds features of other query languages, such as SPARQL, that are commonly used for querying knowledge graphs [10,23].Herein we provide a guided tour of the syntax of DGQL.A full formal specification of the language can be found in the appendix of this paper.
To introduce the features of the query language, in Figure 11 we present (a snippet of) a bibliographical knowledge graph representing data about publications, authors, institutions, etc.The knowledge graph is represented as a property domain graph,  where, for authorship relations, we use properties on the edge to indicate the author order, but directly link the edges (via their ids) to the organization node with which the author was affiliated for that particular paper (something not directly possible in property graphs).We further use abstract node and edge ids (n 1 , . . ., n 15 , e 1 , . . ., e 21 ) for brevity, though these may be instantiated with application ids; for example, in Wikidata, the node n 15 denoting the U.S. might rather have the id Q30.
We will use this knowledge graph as a running example in order to illustrate the MillenniumDB query language in the context of a bibliographical use-case, where we wish to analyze citations, find possible collaborators, etc.

Domain Graph Queries
A DGQL query takes the following high-level form: Querying objects.The most basic query will return all the objects (or more precisely, their ids) in our property domain graph.In MillenniumDB we can achieve this via the following query:

MATCH
Over the knowledge graph of Figure 11, this would return a table of all node and edge ids: n 1 , . . ., n 15 , e 1 , . . ., e 21 .Of course, one usually wants to select objects with a certain label, or a certain value in a specific property, as illustrated in the following example.EXAMPLE 4.1.The following DGQL query returns articles published in 1967 from Figure 11: This returns the ids of nodes with label article and value 1967 for the property year, along with their value for the property name, i.e., we return two results as follows: "Pr.Lang.for Aut." If, for example, n 3 did not have a name, we would still return n 3 as a result, leaving the corresponding value for ?x.name blank.
If we wish to specify a range, we can rather use the WHERE clause, which allows us to specify conditions on the results returned.EXAMPLE 4.2.If we want to find articles published before 1990, we can use the following query: MATCH (?x :article) WHERE ?x.year < 1990 RETURN ?x, ?x.nameThis returns the same solutions over Figure 11 as in Example 4.1.If we were to replace "<" with "<=", we would receive a third result for n 5 and "Add.Machines".
Querying edges.In order to query over edges, we can write the following query, which returns γ, i.e., the relation DOMAINGRAPH: The RETURN * operation projects all variables specified in the MATCH pattern, while the construct (?x)-[?e :?t]->(?y) specifies that we want to connect the object in ?x with an object in ?y, via an edge with type ?t and id ?e.This is akin to a query DOMAIN-GRAPH(?x,?t,?y,?e) over the domain graph relation.Variable or constant edge types (e.g.?t above) are prefixed by a colon.EXAMPLE 4.3.Over the graph of Figure 11, the aforementioned query would return results of the following form, with 21 results in total (one for each edge):  This is akin to returning the DOMAINGRAPH relation.
We can also restrict which edges are matched, as shown in the following example.
EXAMPLE 4.4.The following query in DGQL will return the ids and names of articles that cite an article of the same year: MATCH (?x)-[:cite]->(?y)WHERE ?x.year == ?y.year RETURN ?x, ?x.name Here we choose to omit the edge id variable as we do not need it (e.g., in the WHERE or RETURN clause).Over the knowledge graph of the running example, this returns a single result: "Pr.Lang.for Aut." In the next example, we illustrate two features together: the ability to return and specify conditions on edge properties, and the ability to query known objects.As per the previous example, the WHERE clause may use Boolean combinations.We recall that the running example uses abstract node ids for brevity.In practice, the node id n 11 could rather be an id such as Q17457, which identifies Donald Knuth on Wikidata.

Path queries.
A key feature of graph databases is their ability to explore paths of arbitrary length.DGQL supports two-way regular path queries (2RPQs), which specify regular expressions over edge types, including concatenation (/), disjunction (|), inverses (ˆ), optional (?), Kleene star (*) and Kleene plus (+).We use =[]=> (rather than -[]->) to signal a path query in DGQL.EXAMPLE 4.6.If we wish to find all of the citations of the article named "Add.Machines", and their respective citations, and so on transitively, we can use the regular expression :cites+ in the following way, further returning the name and year of the articles where available:  Notice that a shortest path to each node is returned, so an additional path to e.g.n 4 using cites is not returned.
The final example for paths illustrates operators nested inside a Kleene star.A path that cycles back to Donald Knuth (n 11 ) is included.If we wished to filter such results, we could use WHERE to require the inequality ?y != n 11 .Also there are two possible shortest paths for the n 11 result (via n 4 or via n 5 ), where the first such path to be found is returned.
If we wished to return all shortest paths, we could use the DGQL keyword ALL before before the path variable.For instance, in the previous example, we can write [ALL ?p . ..], which returns a second result for n 11 indicating the other shortest path.
Unlike Cypher, we can return paths matching 2RPQs, not just Kleene star.Unlike SPARQL, we can return paths, not just pairs of nodes.No manipulation of path variables, apart from outputting the result, is currently supported in MillenniumDB, but a full path algebra will be supported in future versions.[3] lie at the core of many graph query languages, including DGQL.Such graphs following the same structure as the data model, but allowing variables in any position.They can be seen as expressing natural xxx (multi)joins over sets of atomic edge patterns.In DGQL, they are given in the MATCH clause.Basic graph patterns are evaluated under homomorphism-based semantics [3], which allows multiple variables in a result to map to the same element of the data.If we evaluate this query over the running example, we get the following results: Given the homomorphism-based semantics, results are returned that map both variables to the same author.If we wished to filter such results, we could stipulate the desired inequalities with WHERE ?x != ?y, which would filter the first, second, fifth and eighth result.

Basic graph patterns. Basic graph patterns
The previous example could equivalently be expressed as a path of the form ?x=[:author/ˆ:author]=>?y.However, with basic graph patterns, we can also capture branches and cycles, as illustrated in the following example.The DGQL query language allows us to take full advantage of domain graphs by allowing joins between edges, types, etc., as illustrated by the following example.EXAMPLE 4.12.The following query looks for articles with an affiliation that is current, i.e., where an author is still staff at the indicated organization: Results for n 9 and n 10 are still returned though they are not currently staff at any organization; the corresponding variables are left blank.Nested optional patterns are also supported.However, optional patterns must form well-designed patterns [31].
Limits and ordering.Some additional operators that MillenniumDB supports are LIMIT and ORDER BY.These allow us to limit the number of output mappings, and sort the obtained results, as illustrated by the following example.
The result returned is as follows: ?x ?x.name n 5 "Add.Machines" Ordering is always applied before limiting results.

Formal definitions for DGQL
For readers interested in a formal specification, we provide the full definition of DGQL, together with the associated semantics, in the appendix to this paper.Specifically, in Appendix A we provide a grammar for DGQL queries, and in Appendix B we define (an equivalent) abstract syntax of DGQL and formal semantics of the language.
Every query language considered in Table 2 supports the notion of a basic graph pattern (BGP), which, in its most general form, is a graph pattern structured like the data, but allowing variables to replace constants.In most cases, the result of a basic graph pattern is a relation (or table) consisting of results, and in some cases it is possible to construct/return a graph (like in G-CORE and SPARQL).
Considering that a graph pattern extracts a table from a graph (as seen in the examples of Section 4.1), relational graph patterns (RGPs) allow the use of relational-based operators to combine the results of one or more graph patterns into a single relation.Full support of this feature in Table 2 indicates that a language provides join, optional, union and negation of graph patterns.Partial support indicates that a language supports some of these operators, usually join and optional graph patterns, as is the case for DGQL (we plan to extend this in future to support more relational operators).Querying edges (QE) is a particular feature of DGQL, allowing for querying relationships involving edges.Notably, DGQL allows an id to be extracted as an edge in one part of the query, and then used as a node in another part.Other query languages provide partial support for querying edges, as they are restricted to query the labels and properties of the edges, require reserved vocabulary (reification), or have other restrictions (e.g., using named graphs in SPARQL over which paths cannot be resolved).
The regular path queries (RPQs) feature refers to matching paths based on (2-way) regular expressions, with concatenation, disjunction, inverse, optional and Kleene star.Partial support indicates that a language offers a restricted group of such operators, such as in the case of Cypher, which supports only Kleene star on top of a single edge type, and not over a subexpression, thus supporting an expression such as cites+, but not a more complex expression, such as (author/ˆauthor)+.
We use the term navigational graph patterns (NGPs) to represent the combination of basic graph patterns and regular path queries.These queries are akin to conjunctive (2-way) regular path queries.NGPs are supported by DGQL, SPARQL and G-Core.
Finally, a query language with full path recovery allows not only to search for some paths, but also to return such paths as objects that can be manipulated (with the nodes and edges in a path).This is a particular feature of G-CORE as it supports path construction operations, and the data model permits storing paths.In Cypher, the resulting paths can be assigned to a variable, so the elements of each path can be accessed by using ad-hoc functions, although with reduced facilities.SPARQL does not support this feature as the output of a path expression is only the start and end nodes of each path.In GSQL and Gremlin, the result of a path query is a set of objects, so the resulting paths must be processed by using a programming language.Currently, DGQL partially supports this feature by returning a path as a string; however, MillenniumDB has been designed to support path manipulation in the future.
Table 2 focuses on core features for querying graphs [3], and thus omits features (e.g., borrowed from SQL) that are supported by some of the languages, and that are potentially very useful in practice, such as aggregations, solution modifiers, federation, etc.Such features can be layered atop the features mentioned.

System architecture
In this section, we describe the internals of the MillenniumDB engine, which have been designed to efficiently support the domain graph model.The overall architecture of the system is presented in Figure 12, and will be explained in the following.
MillenniumDB is founded on tried and tested relational techniques: it stores the (property) domain graph model as several relations indexed in B+ trees, loading parts xxx into main memory as needed using a fixed-size buffer.It also uses algorithmic techniques recently suggested in the theoretical literature for evaluating queries [48,8] techniques not typically implemented in graph database systems -for supporting the domain graphs model in practice.Specifically, we combine three different techniques that are new to the architecture of graph database systems when used in conjunction.First, the data model is encoded as basic relations, indexed following different attributes orders, wherein data objects (e.g., nodes, strings) are represented by ids.Second, we translate the evaluation of any query to several joins between basic relations, which we manage using worst-case optimal join algorithms [30]: an evaluation technique recently proposed for relational database systems.Last, we combine join algorithms with the evaluation of path queries by compiling the path pattern into an automaton and running the query on the fly.These techniques, together, are at the heart of how MillenniumDB optimizes queries over the domain graphs model in practice.
In what follows, we explain how one can store (property) domain graphs and index them.We then outline the query evaluation process and the algorithmic techniques it uses, like the worst-case optimal query plan and the evaluation of path queries.• Nodes, which are objects in the range of γ.They are divided into two subclasses: named nodes, which are objects in the domain graph for which an explicit name is available (e.g.Q320 in Wikidata), and anonymous nodes, which are internally generated objects without an explicit name available to the user (similar to blank nodes in RDF [22]).
• Edges, which are objects in the domain and range of γ, and are always anonymous, internally generated objects.
• Values, which are data objects like strings, integers, etc.These values are classified in two subclasses: inlined values, which are values that fit into 7 bytes of the identifier after the mask (e.g. 7 byte strings, integers, etc.), and external values, which are values longer than 7 bytes (e.g.long strings).
All records stored in MillenniumDB are composed of these identifiers.We will explain later how long strings for external values are handled.
To store property domain graphs, MillenniumDB deploys B+ trees [32].For this purpose, we build a B+ tree template for fixed sized records, which store all classes of identifiers.To store a property domain graph G = (O, γ, lab, prop), we simply store and index in B+ trees the four components defining it: • OBJECTS(id) stores the identifiers of all the objects in the database (i.e., O).
• DOMAINGRAPH(source,type,target,eid) contains all information on edges in the graph (i.e., γ), where eid is an edge identifier, and source, type, and target can be ids of any class (i.e., node, edge, or value).By default, four permutations of the attributes are indexed in order to aid query evaluation.These are: source-targettype-eid, target-type-source-eid, type-source-target-eid and type-target-source-eid. • LABELS(object,label) stores object labels (i.e., lab).The value of object can be any identifier, and the values of label are stored as ids.Both permutations are indexed.
• PROPERTIES(object,property,value) stores the property-value pairs associated with each object (i.e., prop).The object column can contain any id, and property and value are value ids.Aside from indexing the primary key, an additional permutation is added to search objects by property-value pairs.
All the B+ trees are created through a bulk-import phase, which loads multiple tuples of sorted data, rather than inserting records one by one.In order to enable fast lookups by edge identifier, we use the fact that this attribute is the key for the relation.Therefore, we also store a table called EDGETABLE, which contains triples of the form (source, type, target), such that the position in the table equals to the identifier of the object e such that γ(e) = (source, type, target).This implies that edge identifiers must be assigned consecutive ids starting from zero internally by MillenniumDB (they are not specified by the user).In total, we use ten B+ trees for storing the data.
To transform external strings and values (longer than 7 bytes) to database object ids and values, we have a single binary file called OBJECTFILE, which contains all such strings concatenated together.The internal id of an external value is then equal to the position where it is written in the OBJECTFILE, thus allowing efficient lookups of a value via its id.The identifiers are generated upon loading, and an additional hash table is kept to map a string to its identifier; we use this to ensure that no value is inserted twice, and to transform explicit values given in a query to their internal ids.Only strings are currently supported, but the implementation interface allows for adding support for different value types in a relatively simple manner.
All of the stored relations are accessed through linear iterators which provide access to one tuple at a time.All of the data is stored on pages of fixed (but parametrized) size (currently 4kB).The data from disk is loaded into a shared main memory buffer, whose size can be specified upon initializing the MillenniumDB server.The buffer uses the standard clock page replacement policy [32].Additionally, for improved performance, upon initializing the server, it can be specified that the OBJECTFILE be loaded into main memory in order to quickly convert internal identifiers to string and integer values that do not fit into 7 bytes.
Evaluating a query.In MillenniumDB, the execution pipeline follows the standard database template where the string of the query is parsed and translated into a logical plan, which is then analyzed and converted into a physical plan, and finally evaluated, as illustrated by the Query Processor component of Figure 12.
A key part is in how the patterns and filters of a DGQL query (see Section 4.1) are evaluated.Specifically, patterns and filters are grouped together into a list of relations that can be edges, labels, properties, or path queries, forming a large multi-way join query.In essence, evaluating these joins is analogous to selecting an appropriate join plan for the relations representing the different elements.This also goes in hand with selecting the appropriate join algorithm for each of the joins.Given that edges, labels, and properties are all indexed, this will most commonly be index nested-loop join.Paths on the other hand are not directly indexed.For this reason, they are pushed to the end of the join plan and joined via nested-loop with the rest of the multi-way join 7 .
MillenniumDB supports different mechanisms for evaluating the multi-way join formed by the pattern and filter of DGQL query.
• A worst-case optimal query plan as described in [24] is used whenever possible.
This approach implements a modified leapfrog algorithm [48] in order to minimize the number of intermediate results that are generated.
• The classical relational optimizer, which is based on cost estimation, and tries to order base relations in such a way as to minimize the amount of (intermediate) results.We currently support two modes of execution here: (i) Selinger-style join plans [35] which use dynamic programming to determine the optimal order of relations.
(ii) In the presence of a large number of relations, a greedy planner [16] is used which simply determines the cheapest relation to use in each step.
Two particular points of interest are the worst-case optimal query planner, and the way that paths are evaluated.Both of these deploy state-of-the-art research ideas that are usually not implemented in practical graph database systems (though some prototypes exist [24,29,6]).We provide some additional details on these next.
Worst-case optimal query plan.Evaluating multiple joins in a worst-case optimal way is done using a modified leapfrog algorithm [24].While a classical join plan does a nested for-loop over relations, leapfrog performs a nested for-loop over variables [48].Specifically, the algorithm first selects a variable order for the query, say (?x, ?y, ?z).It then intersects all relations where the first variable ?x appears, and over each solution for ?x returned, it intersects all relations where ?y appears (replacing ?x in its current solution), and so on to ?z, until all variables are processed and the final solutions are generated.We refer the reader to [48] and [24] for a detailed explanation.Two critical aspects for supporting this approach are indexes and variable ordering, explained next.
To support the leapfrog algorithm over traditional relational indexes such as B+ trees, we should index all relations in all possible orders of their attributes in order to ensure efficient intersections, which greatly increases disk storage [24].In Millen-niumDB, we include four orders for DOMAINGRAPH, and all orders for LABELS and PROPERTIES.With these orders we can cover the most common join-types that appear in practice [10] by a worst-case optimal query plan.We use the classical relational optimizer if the plan needs an unsupported order or one of the relations uses a path query.
The leapfrog algorithm further requires choosing a variable ordering, which is crucial for its performance.The heuristic we deploy for selecting the variable ordering mixes a greedy approach, and the ideas of the Graham-Yu-Özsoyoglu (GYO) reduction [51].More precisely, we first order the variables based on the minimal cost of the relations they appear in and resolve ties by selecting the variable that appears in more distinct relations.The variables "connected" to the first one chosen are then processed in the same manner (where connected means appearing in the same relation) until the process can not continue.The isolated variables are then treated last.

Evaluating path queries.
For evaluating a path query (2RPQ), the path pattern is compiled into an automaton.Then a "virtual" cross-product of this automaton and the graph is constructed on-the-fly, and navigated via breadth-first search, as commonly suggested in the theoretical literature [28,7,8] (our experiments in Section 6 will also test a depth-first search (DFS) variant).Our assumption is that each path pattern will have at least one of the endpoints assigned before evaluation.This can be done either explicitly in the pattern, or via the remainder of the query.For instance, a path pattern (Q1)=[P31*]=>(?x) has the starting point of our search assigned to Q1.On the other xxx hand, (?x)=[P31*]=>(?y :person) does not have any of the endpoints assigned, however, the (?y :Person) allows us to instantiate ?y with any node with the label :person.
Intuitively, from a starting node (tagged with the initial state of the automaton), all edges with the type specified by the outgoing transitions from this state are followed.The process is repeated until reaching an end state of the automaton, upon which a result can be returned.This allows a fully pipelined evaluation of path queries, while only requiring at most a fixed amount of memory (the neighbors of the node on the top of the BFS queue).Additionally, the BFS algorithm also allows us to return a single shortest path between each pair of endpoints (see Section 4 for an example).Returning a single shortest path comes almost for free, given that it can be reconstructed using the set of visited nodes as used for bookkeeping in the BFS algorithm.The algorithm can also be extended to return all shortest paths (as supported by DGQL) by keeping a list of predecessors that reach the node via a path of shortest length.
The implemented algorithm only requires two permutations of the DOMAIN-GRAPH relation: one for retrieving all of a node's successors via an edge of a specified type; and another for retrieving all such predecessors of a given node.

Benchmarking
In this section, we provide an experimental evaluation of the core graph querying features of MillenniumDB addressing two key questions: (Q1) Which join and path algorithms provide the best performance over domain graphs?(Q2) How does MillenniumDB's performance compare with existing graph database engines?
We base our experiments on the Wikidata knowledge graph [49], which is one of the largest and most diverse real-world knowledge graphs that is publicly available, and also provides a public log of real-world queries posted by Wikidata users that we can use for experiments [27,10].The experiments focus on two fundamental query features: (i) basic graph patterns (BGPs); and (ii) path queries.Regarding (Q1), we compare the performance of different join and path algorithms within MillenniumDB.Regarding (Q2), we also provide a side by side comparison with several popular persistent graph database engines that support BGPs and at least the Kleene star feature for paths.We publish the data, queries, scripts, and configuration files for each engine online, together with the scripts used to load the data and run the experiments [42].
Internal baselines.The base of our comparison is the MillenniumDB implementation available at [41].For comparing the performance of different join and path algorithms in MillenniumDB (per Q1), we include internal baselines, where we test: (i) Millenni-umDB LF, which is the default version implementing the leapfrog triejoin algorithm; (ii) MillenniumDB GR, which implements the greedy algorithm for selecting the join order; and (iii) MillenniumDB SL, implementing the Sellinger join planner.Similarly, for path queries, we test (a) MillenniumDB BFS, the default version of the engine; and (b) MillenniumDB DFS, which evaluates path queries using the depth-first traversal.
Other engines.We also compare the performance of MillenniumDB with five persistent graph query engines (per Q2).First, we include three popular RDF engines: Jena TDB version 4.1.0[40], Blazegraph (BlazeG for short) version 2.1.6[47], and Virtuoso version 7.2.6 [13].We further include a property graph engine: Neo4J community edition 4.3.5 [50]. 8Finally, we also compare with Jena Leapfrog (Jena LF, for short) -a xxx version of Jena TDB implementing a leapfrog-style algorithm [24] -in order to compare with an external graph database using a worst case optimal algorithm.
The machine.All experiments described were run on a single commodity server with an Intel®Xeon®Silver 4110 CPU, and 128GB of DDR4/2666MHz RAM, running the Linux Debian 10 operating system with the kernel version 5.10.The hard disk used to store the data was a SEAGATE ST14000NM001G with 14TB of storage.
The data.The base for our experiments is the Wikidata dataset.In particular, we used the truthy dump version 20210623-truthy-BETA [14], keeping only triples in which (i) the subject position is a Wikidata entity, and (ii) the predicate is a direct property.We call this dataset Wikidata Truthy.The size of the dataset after this process was 1,257,169,959 triples.The simplification of the dataset is done to facilitate comparison across multiple engines, specifically to keep data loading times across all engines manageable while keeping the nodes and edges necessary for testing the performance of BGPs and property paths.The size of the Wikidata Truthy dataset, when loaded into the respective systems, is summarized in Table 3. Default indices were used on Jena TDB, Blazegraph and Virtuoso.Jena LF stores three additional permutations of the stored triples to efficiently support the leapfrog algorithm for any join query, thus using more space.Neo4j by default creates an index for edge types (as of version 4.3.5).To speed up searches for particular entities and properties, we also created an index linking a Wikidata identifier (such as, e.g., Q510) to its internal id in Neo4j.We also tried to index literal values in Neo4j, but the process failed (the literals are still stored).Mil-lenniumDB uses extra disk space because of the additional indices needed to support worst-case optimal join over domain graphs (similar to the case of Jena LF).
How we ran the queries.We detail the query sets used for the experiments in their respective subsections.To simulate a realistic database load, we do not split queries into cold/hot run segments.Rather, we run them in succession, one after another, after a cold start of each system (and after cleaning the OS cache).This simulates the fact that query performance can vary significantly based on the state of the system buffer, or even on the state of the hard drive, or the state of OS's virtual memory.For each system, queries were run in the same order.We record the execution time of each individual query, which includes iterating over all results.We set a limit of 100,000 distinct results for each query, again in order to enable comparability as some engines showed instability when returning larger results.
Memory usage.Blazegraph, Jena and Virtuoso were assigned 64GB of RAM, as is recommended.Neo4J was run with default settings 9 , while MillenniumDB had access to 32GB for main-memory buffer, and it uses an additional 10GB for in-memory dictionar-xxx ies.Since the systems tested are buffer-based -i.e., since they reserve a fixed amount of scrap space (the buffer) in main memory for their operation, and do not exceed this memory (except perhaps modulo a small amount used for internal operations) -and since they tend to use the buffer available, their maximum memory usage corresponds to the these settings.Thus, in the rest of this section we focus on comparing runtimes.
Handling timeouts.We defined a timeout of 10 minutes per query for each system.Apart from that, we note that most systems had to be restarted upon a timeout as they often showed instability, particularly while evaluating path queries.This was done without cleaning the OS cache in order to preserve some of the virtual memory mapping that the OS built up to that point.In comparison, MillenniumDB managed to return a non-trivial amount of query results on each query, and did not need to be restarted, thus handling timeouts gracefully.

Basic Graph Patterns
We focus first on basic graph pattern queries.To test different query execution strategies of MillenniumDB, we use two benchmarks: Real-world BGPs and Complex BGPs, which are described next.

Real-world BGPs
The Wikidata SPARQL query log contains millions of queries [27], but many are trivial to evaluate.We thus generate our benchmark from more challenging cases, i.e., a smaller log of queries that timed-out on the Wikidata public endpoint [27].From these queries we extracted their BGPs, removing duplicates (modulo isomorphism on query variables).We distinguish queries consisting of a single triple pattern (Single) from those containing more than one triple pattern (Multiple).The former set tests the triple matching capabilities of the systems, whereas the latter set tests join performance.Single contains 399 queries, whereas Multiple has 436 queries.
Real-world Single Table 4 (top) summarizes the query times on this set, whereas Figure 13 (left) shows boxplots with more detailed statistics on the distributions of runtimes.Since these queries do not require joins, we show one variant for Millenni-umDB.MillenniumDB is the fastest overall (median of 0.05 s), followed by Blazegraph (median of 0.09 s).In terms of average times and higher percentiles, MillenniumDB more clearly outperforms other engines, being able to enumerate up to 100,000 results (the limit) more quickly due to decoding internal ids more quickly.In MillenniumDB, values such as P12 or Q10 that fit within 7 bytes are inlined, and do not need to be dictionary decoded.The remaining dictionary fits entirely in available memory (∼8 GB of RAM).In the other systems, dictionary decoding generates random accesses to the disk.The four SPARQL engines tested must store IDs as IRIs within the RDF model, which include relatively long prefixes.However, since RDF datasets typically have few prefixes repeated often, we could support full IRIs within MillenniumDB with minimal overhead by encoding a prefix id for the top k prefixes in ⌈log 2 (k)⌉ bits within the object identifier, keeping the small mapping from prefix id to string in memory.The Wikidata query service lists 32 prefixes, which would require 5 bits that would fit "for free" in the class byte (essentially considering each prefix to be a class).Complex BGPs.This is a benchmark used to test the performance of worst-case optimal joins [24].Here, 17 different complex join patterns were selected, and 50 different queries generated for each pattern, resulting in a total of 850 queries.Figure 13 (right), and Table 4 (bottom), show the resulting query times.In this case, the difference between the join algorithms of MillenniumDB is more clear.The worst-case-optimal version (MillenniumDB LF) is not only considerably more stable than the other two versions, but also twice as fast in the median.We can also observe that MillenniumDB GR wins out over MillenniumDB SL on average (but not the median case).When comparing with other engines, the next-best competitor after MillenniumDB LF is Jena LF, showing the benefits of worst-case optimal joins.Virtuoso follows not far behind, while MillenniumDB GR, Jena, Blazegraph and Neo4j are considerably slower.Overall, Mil-lenniumDB LF offers the best performance for every statistic shown in the plot.

Path Queries
To test the performance of path queries, we extracted 2RPQ expressions from a log of queries that timed out on the Wikidata endpoint [27].The original log has 2110 queries.After removing queries that do not use direct properties (which are absent in the Wikidata Truthy dataset), we ended up with 1683 queries.These were run in succession, each restricted to return at most 100,000 results.In the case of SPARQL engines, we added the DISTINCT keyword to remove duplicates caused by the rewriting of fixed-length path queries to unions of BGPs that are then evaluated under bag semantics.To make the comparison fair, the DISTINCT keyword was also added in MillenniumDB queries.
Each system was started after cleaning the system cache, and with a timeout of 10 minutes.Since these are originally SPARQL queries, not all of them were supported by Neo4J given the restricted regular-expression syntax it supports.MillenniumDB and Neo4J were the only systems able to handle timeouts without being restarted. 10In this comparison we do not include Jena LF since it uses the same execution strategy as Jena for property paths.Likewise, for MillenniumDB, we introduce two internal baselines for breadth-first search (BFS) and depth-first search (DFS).The experimental results for these path queries are summarized in Table 5 and Figure 14.
In terms of our internal comparison, we can see that the DFS algorithm slightly xxx ..... Figure 14: Boxplots of query times on property paths outperforms BFS.The reason for keeping BFS as the default algorithm is twofold: (i) it significantly outperforms DFS when paths are also returned; and (ii) it supports returning all shortest paths between any pair of nodes.To illustrate point (i), we ran our experiments again, but now also returning a single path witnessing each query answer.In this case, the average for BFS is 5.9 sec, and median is 0.086 sec.On the other hand, when paths are returned, DFS takes 7.9 sec on average, and 0.1 sec median time.
Compared with other engines, MillenniumDB is generally the fastest, and has the most stable performance.Its average is near a second, i.e., five times faster than the next best contender (Virtuoso).Its median, below 0.1 seconds, is half the next one (Jena's).Even after removing the queries that timed-out on the other systems, they are considerably slower than MillenniumDB.In particular, if we only consider the queries that run successfully on Virtuoso (i.e., excluding the 59 queries that timed-out or gave an error), we get an average time of 0.85 seconds and a median time of 0.086 seconds on MillenniumDB: less than half the times of Virtuoso with these queries excluded.The boxplots further show the stability of MillenniumDB: the medians of other engines are above the third quartile of MillenniumDB.Their third quartile is 5-10 times higher than MillenniumDB's, and higher than its topmost whisker.
To further test robustness, we also ran all of the queries without limiting the output size on MillenniumDB.In this test, the engine timed out in only 15 queries, each returning between 800 thousand and 44 million results before timing out.When running queries to completion, MillenniumDB BFS averaged 13.4 seconds per query (8 seconds excluding timeouts), with a median of 0.1 seconds (both with and without timeouts).

Wikidata Complete
To show the scalability of MillenniumDB, and to property leverage its domain graph, we ran experiments with a full version of Wikidata.We call this dataset Wikidata Complete, and base it off the Wikidata JSON dump 11 version 20201102-all.json,which is preprocessed and mapped to our data model.In Wikidata Complete, we model qualifiers xxx We use properties to store the language value of each string in Wikidata, and also to model elements of complex data values (e.g., for coordinates we would have objects with properties latitude and longitude, and similarly for amounts, date/time, limits, etc.).Each object representing a complex data value also has a label specifying its data type (e.g.coord for geographical coordinates).All qualifiers were loaded.The only elements excluded from the full Wikidata data dump were sitelinks and references.This full version of Wikidata resulted in a knowledge graph with roughly 300 million objects, participating in 4.3 billion edges.The total size on disk of this data was 827GB in MillenniumDB, i.e., more than four times larger than Wikidata Truthy.More details about this dataset can be found in the online material accompanying this paper [42].
We ran the same queries from the benchmarks (Single, Multiple and Complex BGPs, as well as Paths).The number of outputs on the two versions of the data, while not the same, was within the same order of magnitude averaged over all the queries.The results are presented in Table 6.As we can observe, MillenniumDB shows no deterioration in performance when a larger database is considered for similar queries.This is mostly due to the fact that the buffer only loads the necessary pages into the main memory, and will probably require a rather similar effort in both cases.We also note that, again, no queries resulted in a timeout over the larger dataset.

Discussion
Regarding (Q1) -i.e., which join and path algorithms provide the best performance in this setting -regarding join algorithms, we can conclude that the worst-case optimal join algorithm consistently outperforms the greedy and Selinger variants, being particularly notable in the case of more complex graph patterns with many joins (wherein Jena LF -also worst-case optimal -was the next best competitor).Worst-case optimal joins use more space for indexing, but provide superior query runtimes.Regarding path algorithms, we see less difference between BFS and DFS: DFS is slightly faster for returning pairs of nodes connected by paths, while BFS is faster for returning paths.
Regarding (Q2) -i.e., how existing graph database systems compare with Millen-niumDB -we found that MillenniumDB, when equipped with the best join and path algorithms, consistently outperforms other competitors in all query sets tested.

Conclusions and looking ahead
This paper presents MillenniumDB, an open-source graph database system with persistent storage implementing the novel (property) domain graph model.Domain graphs adopt the natural idea of adding edge ids to directed labeled edges in order to concisely model higher-arity relations in graphs, as needed in Wikidata, without the need for reserved vocabulary or reification.They can naturally represent popular graph models, such as RDF and property graphs, and allow for combining the features of both models in a novel way.While the idea of using edge ids as a hook for xxx modeling higher-arity relations in graphs is far from new (see, e.g., [21,25,26]), it is an idea that is garnering increased attention as a more flexible and concise alternative to reification.Our work proposes a formal data model that incorporates edge ids, a query language that can take advantage of them, and a fully-fledged graph database engine that supports them by design.We also propose to optionally allow (external) annotations on top of the graph structure, thus facilitating better compatibility with property graphs, whereby labels and property-values can be added to graph objects without adding new nodes and edges to the graph itself.
We have also proposed a new query language with a syntax inspired by Cypher, but that additionally enables users to take full advantage of the domain graph model by (optionally) referencing edge ids in their queries, and performing joins on any element of the domain graph.We further combine useful features present in both Cypher and SPARQL, in order to provide additional expressivity, such as returning the shortest path witnessing a result for a path query (as captured by a 2RPQ expression).
In the implementation of MillenniumDB, we combine both tried-and-trusted techniques that have been successfully used in relational database pipelines for decades [32] (e.g., B+ trees, buffer managers, etc.), with promising state-of-the-art algorithms for computing worst case optimal joins (leapfrog [48]) and evaluating path queries (guided by an automaton [28,7]).Our experiments over Wikidata, considering real-world queries and data at large-scale, show that this combination outperforms other persistent graph database engines that are commonly found in practice.
Limitations.Many of the current limitations of MillenniumDB relate to the fact that it is still under development.For example, at the moment, MillenniumDB only supports a bulk load of data, where support for (incremental) updates is currently under investigation and development.Currently only the core features of query languages such as Cypher and SPARQL are supported, where we are working on adding support for other features, including negation, value assignment, functions on datatypes, etc. MillenniumDB lacks some of the advanced features supported by other graph database systems, such as geographic, temporal and federated queries, keyword search features, etc.Finally, MillenniumDB does not yet support partitioning the graph over multiple machines in order to achieve horizontal scaling.We do not see such limitations as fundamental, but rather as features that can be added to the engine over time.

Future work.
Looking to the future, we foresee extensions such as: returning entire graphs, supporting more complex path constraints, returning sets of paths, path algebra, just to name a few.Regarding more practical features, we aim to add support for full transactions, keyword search, a graph update language, existing graph query languages, and more besides.More importantly, given that MillenniumDB is published as an open source engine, we hope that the research community can view the Millenni-umDB code base as a sandbox for incorporating their novel algorithms and ideas into a modern graph database, without the need to remake storage, indexing, access methods, or query parsers.Along these lines, we are currently working on adding an inmemory storage option to MillenniumDB using the ring [6]: a data structure based on the Burrows-Wheeler transform that supports worse-case optimal joins (over triples) in space similar to representing the graph itself.Initial tests show that the ring can store Wikidata Truthy in 50GB of space and improve median query times by a factor of 3, with average query times remaining similar.We are working on extending the ring to support edge ids and thus work with domain graphs.We also wish to explore the deployment of MillenniumDB for key use-cases; for example, we plan to provide xxx and host an alternative query service for Wikidata, which may help to prioritize the addition of novel features and optimizations as needed in practice.Intuitively, the MATCH clause specifies the basic or navigational graph pattern which we will look for in our graph.The WHERE clause is used to filter result based on a selection, usually by restricting the values of some of the attributes of a matched object.The RETURN clause specifies which of the matched variables will be returned.The ORDER BY clause allows us to reorder the results based on the values of some output variables, while LIMIT cuts off the evaluation after a specific number of results have been found.
We define the formal syntax of the DGQL query language in Figure 16.Examples of DGQL queries following this syntax can be found in Section 4.1.

B Formal Definition of Domain Graph Queries
Queries in MillenniumDB are based on the abstract notion of a domain graph query, which generalizes the types of graph patterns used by modern graph query languages [3].This query abstraction provides modularity in terms of how the database is constructed, flexibility in terms of what concrete query syntax is supported, and allows for defining its semantics and studying its theoretical properties in a clean way.
This section provides the formal definition of the MillenniumDB query language.From now on, assume an infinite set Var of variables disjoint with the set of objects Obj.

B.1 Basic graph patterns
At the core of domain queries are basic graph patterns. 12A basic graph pattern is defined as a pair (V, ϕ) such that ϕ : (Obj ∪ Var) → (Obj ∪ Var) × (Obj ∪ Var) × (Obj ∪ Var) is a partial mapping with a finite domain and V ⊆ var(ϕ), where var(ϕ) is the set of variables occurring in the domain or in the range of ϕ.Thus, ϕ can be thought of as a domain graph that allows a variable in any position, together with a set V of output variables (hence the restriction that each variable in V occurs in ϕ).
The evaluation of a basic graph pattern returns a set of solution mappings.A solution mapping (or simply mapping) is a partial function µ : Var → Obj.The domain of a mapping µ, denoted by dom(µ), is the set of variables on which µ is defined.Given v ∈ Var and o ∈ Obj, we use µ(v) = o to denote that µ maps variable v to object o.Given a set V ′ of variables, the term µ | V ′ is used to denote the mapping obtained by  restricting Finally, for the sake of presentation, we assume that µ(o) = o, for all o ∈ Obj.
The evaluation of a basic graph pattern B = (V, ϕ) over a domain graph G = (O, γ), denoted by B G , is defined as For example, consider the basic graph pattern (V, ϕ) where V = {v 2 , v 4 , v 6 } and ϕ is given by the assignments: In Figure 17, we provide a graphical representation of the above graph pattern, and the solution mappings obtained by evaluating the graph pattern over the property domain graph shown in Figure 10.The solution mappings are presented as a table with columns v 2 , v 4 , v 6 (i.e. the variables in V ), and each row represents an individual mapping.In our definitions, different variables may map to the same object in a single solution.Thus, our notion of evaluation follows a homomorphism-based semantics, similar to query languages such as SPARQL [3]. 13upporting labels and properties.Observe that the formalization thus far only allowed to access elements of the function γ, and could not reason about labels, nor properties.In essence, up to now we only defined the semantics of queries over domain graphs.We now extend this definition to property domain graphs.
Following our approach of modelling a query similarly as a domain graph, we define a basic graph pattern with properties as a tuple (V, ϕ, Qlab, Qprop), where: • a ∈ dom(Qlab) implies that there are x, y, z, w such that ϕ(x) = (y, z, w); and a ∈ {x, y, z, w} • Qprop : (Obj ∪ Var) × P → V is a partial mapping; xxx • (a, k) ∈ dom(Qprop), for some k ∈ P, implies that there are x, y, z, w such that ϕ(x) = (y, z, w); and a ∈ {x, y, z, w} A basic graph pattern with properties extends basic graph patterns with a labelling function and a property checking function.Notice that the domain constraints on Qlab and Qprop serve to make sure that these are associated to some variable or object used in the core pattern.Given a property domain graph G = (O, γ, lab, prop), the semantics of a basic graph pattern with properties BP = (V, ϕ, Qlab, Qprop), denoted BP G is defined as This extends the evaluation to also support labels and properties with their values, akin to making the query pattern have the same structure as the property domain graph, allowing us to query labels and properties as described in Section 4.

B.2 Navigational graph patterns
A characteristic feature of graph query languages is the ability to match paths of arbitrary length that satisfy certain criteria.We call basic graph patterns enhanced with this feature navigational graph patterns, and we define them next.
A popular way to express criteria that paths should match is through regular expressions on their labels, aka.2-way regular path queries (2RPQs).More precisely, an 2RPQ expression r is defined by the following grammar: The semantics of an 2RPQ expression r is defined in terms of its evaluation on a (property) domain graph G, denoted by r G , which returns a set of pair of nodes in the graph that are connected by paths satisfying r.More precisely, assuming that G = (O, γ), o ∈ Obj and r, r 1 , r 2 are 2RPQ expressions, we have that: Moreover, assuming that r 1 = r and r n+1 = r/r n for every n ≥ 1, we have that: Other 2RPQ expressions widely used in practice can be defined by combining the previous operators.In particular, r? = ε + r and r + = r/r * .A path pattern is a tuple (a 1 , r, a 2 ) such that a 1 , a 2 ∈ Obj ∪ Var and r is a 2RPQ expression.As for the case of basic graph patterns, given a path pattern p, we use the xxx term var(p) to denote the set of variables occurring in p.Moreover, the evaluation of p = (a 1 , r, a 2 ) over a property-domain graph G, denoted by p G , is defined as: p G = {µ | dom(µ) = var(p) and (µ(a 1 ), µ(a 2 )) ∈ r G }.
For example, the expression (Michelle Bachelet, (replaced by) + , v) is a path pattern that returns all the Presidents of Chile after Michelle Bachelet.Given a set ψ of path patterns, var(ψ) also denotes the set of variables occurring in ψ, and the evaluation of ψ over a property-domain graph G is defined as: A navigational graph pattern is a triple (V, ϕ, ψ) where (V ′ , ϕ) is a basic graph pattern for some V ′ ⊆ V , ψ is a set of path patterns, and V ⊆ var(ϕ) ∪ var(ψ).The semantics of a navigational graph pattern N = (V, ϕ, ψ) is defined as: Hence, the result of a navigational graph pattern N = (V, ϕ, ψ) is a set of mappings µ projected onto the set V of output variables, where µ satisfies the structural restrictions imposed by ϕ and the path constraints imposed by ψ.Notice that multiple 2RPQ expressions can link the same pair of nodes.This is similar to the existential semantics of path queries, as specified in the SPARQL standard [17].
Given a domain graph G = (O, γ), we define paths over the directed labeled graph that forms the range of γ; in other words, we do not allow for matching paths that emanate from an edge object (except when it appears as a node).Such a feature could be considered in the future.We may also consider additional criteria on node or edges in the matching paths, etc.

B.3 Relational graph patterns
As previously discussed (and seen in the example of Figure 17), graph patterns return relations (tables) as solutions.Thus we can -and many practical graph query languages do -use a relational-style algebra to transform and/or combine one or more sets of solution mappings into a final result.
Towards defining this algebra, we need the following terminology.Two mappings µ 1 and µ 2 are compatible, denoted by µ 1 ∼ µ 2 , if µ 1 (v) = µ 2 (v) for all variables v which are in both dom(µ 1 ) and dom(µ 2 ).If µ 1 ∼ µ 2 , then we write µ 1 ∪ µ 2 for the mapping obtained by extending µ 1 according to µ 2 on all the variables in dom(µ 2 ) \ dom(µ 1 ).Given two sets of mappings Ω 1 and Ω 2 , the join (⋊ ⋉), anti-join (▷) and left outer join (⋊ ⋉) between Ω 1 and Ω 2 are defined respectively as follows: With this terminology, a relational graph pattern is recursively defined as follows: • If N is a navigational graph pattern, then N is also relational graph pattern; • If R 1 and R 2 are relational graph patterns, then (R 1 AND R 2 ) and (R 1 OPT R 2 ) are relational graph patterns. xxx The evaluation of a relational graph pattern R over a property-domain graph G, denoted by R G , is recursively defined as follows: • if R is a navigational graph pattern N , then R G = N G ;

B.4 Selection conditions
In addition to matching a graph pattern against a property-domain graph, we would like to filter the solutions by imposing selection conditions over the resulting objects (i.e.nodes and edges).More precisely, a selection condition is defined recursively as follows: (a

B.5 Solution modifiers
We consider an initial set of solution modifiers that allow for applying a final transformation on the solutions generated by a graph pattern.These include: RETURN, which defines a set of elements (variables and properties) to be returned; ORDER BY, which orders the solutions according to a sort criteria; and LIMIT, which returns the first n mappings in a sequence of solutions (with n specified in the clause).Notice here that the solution mappings are not defined by the RETURN solution modifier, but rather by the relational graph pattern, and by selection conditions.
Let S be the set of strings, v ∈ Var and k ∈ K.A return mapping is a function τ : S → Obj ∪ V.A return element is either a variable v or an expression v.k.Assume that there is a simple way to transform a return element into a string in S. Given a sequence of return mappings S and an integer n, the function limit(S, n) returns the first n elements of S when n > 0, and returns S otherwise.
xxx An order modifier is a tuple (e, β) where e is a return element and β is either asc or desc.Given a sequence of return mappings S and an order modifier o = (e, β), we say that S satisfies o, denoted S |= o, if it applies that: (i) β is asc and S satisfies an ascending order with respect to e; or (ii) β is desc and S satisfies a descending order with respect to e.Moreover, given a sequence of order modifiers O = (o 1 , . . ., o n ), we say that S satisfies O, denoted S |= O, if it applies that: (i) S |= o 1 when n = 1; or (ii) S |= o 1 and, for every sub-sequence of selection mappings S ′ ⊆ S such that τ i (e 1 ) = τ j (e 1 ) (with o 1 = (e 1 , β 1 )) for any pair of selection mappings τ i , τ j ∈ S ′ , it holds that S ′ |= (o 2 , . . ., o n ).

B.6 Graph Queries
A graph query Q is defined as a tuple (R, C, E, O, n), where R is a relational graph pattern, C is a selection condition, E is a sequence of return elements, O = {o 1 , . . ., o n } is a sequence of order modifiers, and n is a positive integer.We assume that R is the unique mandatory component.Given a variable v ∈ dom(R), the remaining components have the following expressions by default: C is v = v, E is v, O is (v, asc) and n = 0.
The evaluation of Q over G is defined as limit(S, n) where S = return(Ω, E) G , S |= O, and Ω = {µ | V | µ ∈ R G ∧ µ |= G C}.We will assume that every graph query Q = (R, C, E, O, n) satisfies the following two conditions: (i) For every sub-pattern R ′ = (R 1 OPT R 2 ) of R and for every variable v occurring in R, it applies that, if v occurs both inside R 2 and outside R ′ , then it also occurs in R 1 ; (ii) It applies that Var(C) ⊆ Var(R).Then, we say that Q is a well-designed graph query.
We finish this section noting that the semantics of a declarative query expression:

Figure 1 :
Figure 1: Information on presidency of Chile.

Figure 2 :
Figure 2: Reified triples representing the duration of a presidency.

Figure 3 :
Figure 3: RDF* triples with another triple as the subject

xxx
An RDF* graph is a finite set of RDF* triples.

Figure 4 :
Figure 4: Property graph representing the information about the presidency of Chile.

Figure 5 :
Figure 5: Wikidata statement group for Michelle Bachelet

Figure 6 :
Figure 6: Directed labeled graph reifying the statements of Figure 5

Figure 7 :
Figure 7: Property graph representing statements of Figure 5.

Figure 8 :
Figure 8: RDF* for one of the statements of Figure 5.

Figure 11 :
Figure 11: A property domain graph describing venues, papers, universities, authors, locations, and their relations.
Pattern WHERE Filters RETURN Variables Data Intelligence Volume x, Number x xxx When evaluated over a property domain graph, such a query will return a multiset of mappings binding Variables to database objects (or values) that satisfy the Pattern specified in the MATCH clause and the Filters specified in the WHERE clause.

EXAMPLE 4 . 15 . 18 Data
In order to find the most recent paper by Donald Knuth, we can use the following DGQL query: MATCH (?x { name : "D.Knuth" })-[:author]->(?y)ORDER BY DESC ?y.year RETURN ?x, ?x.name LIMIT 1 Intelligence Volume x, Number x xxx Table 2: Query features supported by graph query languages (BGP = basic graph patterns, RGP = relational graph patterns, QE = querying edges, RPQ = regular path queries, NGP = navigational graph patterns, FPR = full path recovery).The symbol ∼ is used to indicate partial support of a feature.Query language BGP RGP QE RPQ NGP FPR DGQL Figure 12: MillenniumDB Architecture Storage and indexing.Let us start by explaining the Disk and Storage Manager part of the MillenniumDB architecture from Figure 12.The main component of the domain graph data model are objects.Objects are represented internally as 8-byte identifiers.To optimize query execution, identifiers are divided into classes and the first byte of the identifier specifies a class it belongs to.The main classes in a property domain graph G = (O, γ, lab, prop) are:

Figure 15 :
Figure 15: General structure of queries in MillenniumDB

Figure 17 :
Figure 17: Graphical representation of a basic graph pattern (left), and the tabular representation of the solution mappings (right) obtained by evaluating the basic graph pattern over the property domain graph shown in Figure 10 .
k 2 and v 1 .k 1 = v are selection conditions; and (b) if C 1 , C 2 are selection conditions, then (¬ C 1 ), (C 1 ∧ C 2 ), (C 1 ∨ C 2 ) are selection conditions.Given a property domain graph G = (O, γ, lab, prop), a mapping µ, and a selection condition C, we say that µ satisfies C under G, denoted by µ |= G C, if one of the following statements holds:
the output of the graph query (R, C, E, O, n) on an input graph.Data Intelligence Volume x, Number x

Table 1
The following query extends that of Example 4.6 by binding paths witnessing each result to a variable ?p:This returns a string representation of a shortest path for each result, as follows:We can also combine different features to capture more complex paths.EXAMPLE 4.8.The following query looks for publications of staff and students of U.S. institutions, further including the direct citations of these publications: [3]pose we wish to find, for example, instances of self-citation of staff at U.S. institutions; we could write this as follows: Considering that the graph patterns considered previously allow for extracting tables from graphs, a way to enrich a graph query language is to support relational operators over these tables[3]; this gives rise to the notion of relational graph patterns.Thus far we have seen the ability to project results with SELECT, and apply selections over results with WHERE.We have also spoken about how basic and navigational graph patterns can be interpreted as natural joins in the relational algebra.
[3]igational graph patterns If we further allow path queries within basic graph patterns, we arrive at navigational graph patterns[3].xxx EXAMPLE 4.13.DGQL also supports optional graph patterns, which behave akin to left outer joins in the relational algebra, i.e., they allow for extending solutions with data that may or may not be available; in case the data of the optional pattern are not available, the solution is still returned and the optional data are left blank.EXAMPLE 4.14.Assume we want to find the authors who have published articles in the Journal of the ACM, their affiliation in those articles, and, if available, the organization at which they are currently staff.The following DGQL query achieves this: MATCH (?v)-[?e :author]->(?w),(?w)-[:venue]->(?x { name = "J.ACM" }), (?e)-[:org]->(?y)

Table 3 :
Wikidata Truthy sizes when loaded into each engine.The base dataset consists of roughly 1.25 billion triples.

Table 4 (
middle) and Figure13(middle) show the results for this set.Comparing different join execution strategies within MillenniumDB, we can see the superiority of the leapfrog triejoin variant (particularly on average, i.e., for more complex queries).The Selinger variant of MillenniumDB outperforms the greedy algorithm for join selection, but only marginally.Compared with existing graph engines, xxx

Table 4 :
Summary of runtimes (in seconds) for BGPsMillenniumDB clearly outperforms other systems on this query set.Its medians are an order of magnitude faster than those of Blazegraph, the next best contender.The difference is less sharp for averages, but MillenniumDB LF still takes 60% of the time of Virtuoso, the next best contender.

Table 5 :
Summary of runtimes (in seconds) for path queries

Table 6 :
Average and median runtimes, in seconds, for MillenniumDB on the complete version of Wikidata