Mapping Biolink Model to a Neo4j property graph
This section describes how a Neo4j database is mapped to the Biolink Model.
Although specific to Neo4j, these recommendations should hold for any Labeled Property Graph (LPG) model, e.g a Python networkx graph (specifically a MultiDiGraph).
For mapping to RDF graphs refer to Mapping to RDF.
Nodes
All nodes in the Neo4j database should be to NamedThing.
Biolink Model defines a typology of nodes, all of which inherit from NamedThing.
Nodes in Neo4j (and property graphs in general) may have node properties.
The NamedThing class defines core properties for a node, plus additional (optional) ones.
Core properties for a node:
In the core bl model, each node must have an ID which is a CURIE [informative] when mapping nodes to RDF, the node ID MUST correspond to the IRI using the link json-ld context [normative]
Note: this is distinct from the internal autogenerated
id
field in Neo4j.
The name
field SHOULD correspond to a concise display label for the entity. For example asthma or Wnt signaling pathway. If the node is an ontology class then name will correspond to the rdfs:label
of that class.
Any Neo4j instance MAY provide as many additional properties as required. These SHOULD come from a registered list of properties for that node type.
For example, the Genotype class provides a property has_zygosity, which is specific to that class (and its sub-classes, if any).
Note: While the CURIE for a property is
biolink:name
that does not necessarily mean the property name has to bebiolink:name
in Neo4j. Instead, the prefix part of the property can be omitted such that the property name is justname
.
Use of Neo4j labels
Nodes in Neo4j can be tagged with label(s) indicating a grouping to which the node belongs. The category
field in the model MUST map to a Neo4j label. The Biolink Model class name in CamelCase MUST be used for the category
field.
Additionally, the Neo4j implementation MAY have additional categories which are super-classes of a specific category.
For example, if a node representing a particular type of neuron has category Cell, then the Neo4j graph may also tag the node with AnatomicalEntity as label, in addition to Cell.
Consequently, any number of additional local labels MAY also be used.
In addition to Neo4j labels, additional subclass of edges may be used to connect a node to an ontology class node.
Implementation Note: Cypher queries that use labels are optimized for speed, under the assumption that an index has already been generated in Neo4j for said label(s).
Terminology note: The term
label
is overloaded. In RDF it usually denotes the name of an entity (rdfs:label
). For this reason we use category as the property name in Biolink Model.
Edges
Each edge in the Neo4j graph should have an edge label or relationship type that is a sub-property of related_to.
For example, two protein nodes may be related via physically_interacts_with relationship type.
Note: Always use snake_case to represent edge labels.
The set of edge labels is deliberately kept minimal. This is partly for practical reasons. Neo4j has no easy way to automatically use sub-property relationship types in Cypher queries. For example, if we have a deep hierarchy of interaction relationships including specific physical interactions such as ‘phosphorylates’, then queries for any kind of interaction must be expanded to include all sub-property relationship types.
More precise relationship types are allowed through the use of relation property.
Edge properties
Neo4j uses a property graph model, where any number of properties can be attached to an edge. Some properties may be generic, while some may only pertain to particular kinds of relationship type.
Edges SHOULD have a relation property which encodes the most specific relationship type for the relationship. This MAY correspond to the edge label, or it MAY be more specific. The relation
property MUST be encoded as a CURIE or IRI.
Edges can also have generic properties that changes the meaning of the edge itself. For example, the generic property negated which logically negates the assertion defined by the edge.
Association types
Biolink Model includes a hierarchy of Association.
Note: This is distinct from the relation hierarchy, although in some cases they parallel one another.
For example, the relation hierarchy has a generic relation part_of. This can be used in different contexts. For example, connecting two anatomical entities, or connecting a pathway to a sub-pathway.
Formally, the Association hierarchy is a classification of edge objects. Any edge SHOULD be mappable to an association_type, using the subject’s category, edge’s relation and object’s category.
Different association types may have different properties associated with them.
The core properties are:
Note: 3 of these properties are builtin, so these do not correspond to edge properties (but they may, for the sake of verbosity).
The edge_label
is a snake_case human-readable high-level grouping relationship.
In contrast, relation is a CURIE from a more refined relationship ontology like RO or SIO.