The Model Graph
Graphs and Schema
In discrete mathematics a graph is a structure defining a set of objects in which some pairs of the objects are in some sense “related”. A property graph is a graph where the objects can contain arbitrary key/value pairs. Property graphs provide an excellent method for storing highly interconnected data since the inter-connectivity of the component objects is recorded directly in the data structure.
Objects stored in a Graph Database are called nodes. Nodes act like rows in an RDBMS, except they are not tied to any specific schema. While this allows any node to store arbitrary data it can cause problems when indexing that data.
Neo4j solves this problem by allowing a lightweight schema on the graph whereby certain object types will be identified as having a given key. This key can then be used to build an index for that type. In the same way that a key in an RDBMS stops the need for a table scan, so a key in a graph database stops the need for the entire graph to be scanned. Indexes can be used for initial lookup, and then high speed graph traversals can be used to get to the requested data set.
The schema is lightweight insofar as arbitrary properties can still be applied to a node. The schema simply allows certain properties to be added to an index if present. With Neo4j Enterprise it is also possible to mandate that a property exists, and ensure the property is unique, as well as asserting a property exist on a relationship.
Models
Mycelium, Tech Marionette’s flagship product, makes extensive use of the flexibility of property graphs. This provides some unique challenges due to the way it needs to model the world.
Mycelium provides a toolkit which lets you define the model of your world at runtime. Mycelium knows nothing about that model other than a set of constraints that the model must follow. Among other things the model defines the records that can exist in the database, and how those records can be linked.
The lightweight schema provided by Neo4j lacks many of the semantics needed by the model. For example, if a node contains optional, unindexed properties it is impossible to unambiguously define that within the schema. The graph defines all possible properties that haven’t been explicitly defined in the schema as valid, optional, unindexed properties. Similarly, while it’s possible to constrain relationship properties, there is no facility to constrain relationship paths.
Mycelium solves this by providing its own schema in the form of a Model, stored in the graph.
Exploded Graphs
Property graphs are a logical representation of a larger pure graph that allow us to visualise portions of the graph as a single object. A node N
in a property graph can have n
properties, each with its own key/value pair. Our node could therefore be defined as:
(:N {key1: value1, ... , keyn: valuen})
It is possible to explode this into a pure graph such that:
(n:N)-[:HAS]->(p1:P)-[:KEY]->(key1)
(p1)-[:VALUE]->(value1)
...
(n)-[:HAS]->(pn:P)-[:KEY]->(keyn)
(pn)-[:VALUE]->(valuen)
Since we can collapse the property graph at any point it is also possible to define our original node as:
(n:N)-[:HAS]->(p1:P {key: key1, value: value1})
...
(n)-[:HAS]->(pn:P {key: keyn, value: valuen})
Here we are still using a property graph, but the properties on the original nodes are now nodes in their own right. Unlike the single node version, this representation allows us to easily store metadata on the properties. Alongside the key
and value
properties we can store extra information, such as type data.
Model Graph
By using an expanded definition for records a complex model tree can be build in the graph that allows for precise definition of what is and isn’t allowed in that model. Relationships between those record definitions explicitly define the possible relationships the records themselves can have. Not only does this give us better control over how the data must look, it allows us to define how the data can look. For example, we can say that there exists potential for two records to be linked via a relationship without mandating that that relationship must exist.
Within Mycelium a model is defined as one or more schema, each with one or more record definitions, and any possible relationships between those records. Given the inherent connectedness of most real world models, the relationships also define distinct networks to limit traversal to the aspects we’re actually interested in.
Without networks, it is easy to accidentally return the entire graph in a query. For example, a report that shows all people linked to an application would simply return everyone in an organisation, since it would be possible to trace connections to everyone in the organisation through unrelated relationships such as reporting lines.
The model also provides the facility to locate nodes representing users, and define their access to sections of the model through direct relationships in the model. Mycelium can check the user has access to checking they have a direct connection to that portion of the graph.
Routing and Data Shadows
The Model graph allows us to perform two important functions with Mycelium. Given a start and end point it is possible to identify potential paths between those points in the Model by following predefined routes. This removes the need to understand the complexities of the models and its connections, focusing instead on the data required. This is especially important with sparse data sets where the data itself would not allow the complete traversal of a route due to missing entries.
These data shadows, that is missing data where data is expected, become readily apparent with the Model Graph, allowing us to find areas with low densities of records, low relative densities of links, and records with incomplete data.
This becomes especially interesting when data becomes partially isolated from the main data set due to missing or incomplete information. Finding small areas of a graph that are completely isolated is a relatively trivial problem, however, due to the highly interconnected nature of most data sets it may be possible to traverse the graph via alternative routes to partially isolated data. This gives a false impression that the data is properly integrated and connected into the graph. When comparing the routes against the model graph it is possible to check if relationships defined there are mirrored in the dataset. Areas where relationships are defined in the model, but are sparse or missing entirely in the dataset are a good indication of data quality issues. Using the defined data model these quality checks can be performed automatically.
Indexes
Mycelium allows for an arbitrary model of the world to be defined at runtime. This poses a problem when it comes to indexing data as you do not know the shape of that data up front.
Our first solution to this problem was to mirror the exploded format used by the model when writing records. Thus, any record R
had N
property nodes associated with it.
(r:R)-[:HAS]->(p1:P {key: key1, value: value1})
...
(R)-[:HAS]->(pN:P {key: keyN, value: valueN})
From here it is entirely possible to place an index on P.key
and P.value
and store entirely arbitrary data under an index. For an unindexed graph with all the properties stored on a node you might run the following:
MATCH (n:N)
WHERE pn = value
RETURN n
Performance on this query will be slow, with speed decreasing as the graph grows in size. For the exploded model the query becomes more complex:
MATCH (n:N)-[:HAS]->(p:P)
WHERE p.key = pn AND p.value = value
RETURN n
However, with this model query times are well within the same order of magnitude as a graph with an index on N.pn
.
This model comes with a few downsides. Write times are longer. For any given node with n
properties the single node model requires 1 node and n
properties. The expanded model requires n+1
nodes, 2n
properties and n
relationships.
Also, queries become more complex.
MATCH (n:N)
WHERE pn > 100 AND pn+1 < 100
RETURN n
Becomes:
MATCH (n:N)-[:HAS]->(p1:P), (n)-[:HAS]->(p2:P)
WHERE
p1.key = pn AND p1.value > 100 AND
p2.key = pn+1 AND p2.value < 100
RETURN n
While this still outperforms an unindexed single node model, it’s an order of magnitude slower than achievable by putting indexes on that single node. The complexity of the queries grows with the complexity of the clauses making queries hard to reason about.
Finally, since all properties are the same basic type, it is not possible to have an unindexed property. By definition both the property name and its value must go into the index. Since this index is shared we also can’t specify uniqueness constraints as this would require uniqueness across the entire dataset, not just the record type.
Dynamic Indexes
Since Mycelium controls the creation and updating of the Model Graph it is also possible for to create and drop indexes as the model changes. This allows for the possibility of dynamic indexes. That is, indexes that are created as needed at runtime.
The simplest way of applying dynamic indexing is to have a Record
node type and simply create an index for each distinct property name. Not only does this potentially mean thousands of indexes on a single node type, it also still precludes uniqueness constraints as property names may be shared.
A more elegant solution is to provide a new node type for each record type. This allows the use of built in constraints. Additionally, it limits the number of indexes in play on creation and querying to the handful related to the node, regardless of the total number of indexes maintained by the database.
Dynamic Queries
Mycelium uses is own custom query and reporting language, Orphic, to handle querying the stored data. Orphic allows queries to be defined in a manner that a business analyst can understand and then transpiles that query into Cypher which can be run over the graph. There is enough information in the Orphic query to build the required Cypher query with no knowledge of the model being used at compile time.
The use of Orphic allows us to abstract away the model graph, and prevents the user from accessing portions of the graph they do not have permission to access. As improvements to indexing and other aspects of Neo4j are released the use of Orphic allows us to shield the user from any changes in the underlying data structures used by Mycelium.
Putting it all together
Mycelium uses a property graph with an overlaid custom schema. This schema itself uses an exploded property graph to define a model of the world. It is created at runtime and can be manipulated dynamically during the lifetime of a Mycelium install.
The custom schema is used to generate a Neo4j schema, and to validate data as it is loaded. When a record is written the schema for that section of the graph is loaded and the data checked against it, allowing us to reject or sanitise bad data. When a record is read it is used to help located routes through the model without having to know these explicitly. With all access it allows users to be easily constrained to the portion of the graph they are allowed to manipulate.
Our query language, Orphic, abstracts details of the graph implementation from the user letting them focus on the data in a manner that feels natural, whilst still leveraging the interconnectedness of the underlying graph.
Recent Comments