KGX Specification#
The KGX format is a serialization of Biolink Model compliant knowledge graphs. This document outlines the structure and organization of this format, detailing the required fields and their significance, along with examples in various formats. KGX supports multiple serialization formats, including JSON, TSV, JSON Lines, and RDF Turtle. Thus KGX is both a format specification and a toolkit for serializing data conformant to Biolink Model in a variety of formats.
Quick Note on Validation#
KGX uses the official Biolink Model JSON Schema for validation.
The Biolink Model JSON Schema defines a root class called KnowledgeGraph
with two key properties:
nodes
: A list of entities (instances ofNamedThing
or its descendants)edges
: A list of associations (instances ofAssociation
or its descendants)
KGX JSON Lines format is simply Biolink Model-compliant JSON objects, one per line. There is no separate “KGX schema”—we validate directly against the Biolink Model schema. For a detailed explanation of how KGX validation works, see: KGX and Biolink Model JSON Schema Validation.
There are some notable initial design decisions for KGX that influence the behavior of the KGX toolkit:
KGX is a serialization format for Biolink Model compliant knowledge graphs
KGX is a flat file format that can be processed, subset, and exchanged easily
Each node or edge is represented with all properties that describe it
KGX prefers that all properties are valid Biolink Model properties, however it is designed to be lenient and allow non-Biolink Model properties in an effort to be more inclusive of existing knowledge graphs and allow Biolink to evolve without breaking existing knowledge graphs.
KGX is not a knowledge graph, but a serialization format for knowledge graphs
KGX is not a knowledge graph model, but a serialization format for knowledge graph models. It follows the Biolink Model.
Introduction#
The KGX format is a serialization of Biolink Model compliant knowledge graphs. This specification defines how this format is structured and organized, describing the required fields and their significance, with examples in various formats.
KGX Format#
The KGX format represents Biolink Model compliant knowledge graphs as flat files that can be processed, subset, and exchanged easily. Each node or edge is represented with all properties that describe it.
Node Record Elements#
We refer to each serialization of a node as a Node record, with the following elements:
Base-level Required Elements:
id
: CURIE that uniquely identifies the node in the graphcategory
: Multivalued list with values from the Biolink NamedThing hierarchy
Common Optional Properties:
name
(string): Human-readable name of the entitydescription
(string): Human-readable description of the entityprovided_by
(array of strings): Sources that created or assembled this node (NOTE: for nodes only, not edges)xref
(array of strings): Database cross-references as CURIEs (e.g., [“ENSEMBL:ENSG00000123456”])synonym
(array of strings): Alternative names for the entityiri
(string): IRI form of the identifiersame_as
(array of strings): Entities that are semantically equivalenturl
(string): Full URL for an external resource about the node (when CURIE is not sufficient)
Domain-Specific Properties:
For Genes/Proteins:
in_taxon
(array of strings): Taxonomic classification CURIEs (e.g., [“NCBITaxon:9606”] for human)in_taxon_label
(string): Human-readable taxon name (e.g., “Homo sapiens”)symbol
(string): Gene symbol
For Publications:
publications
(array of strings): Referenced publications as CURIEs (e.g., [“PMID:12345678”])Note: While
publications
is typically an edge property, it can be used on nodes representing publication entities themselves
Additional Notes:
Non-Biolink Model properties are allowed and won’t violate the specification - this was an intentional design decision to be more inclusive of existing knowledge graphs and allow Biolink to evolve without breaking existing knowledge graphs.
Edge Record Elements#
Each serialization of an edge (Edge record) includes:
Base-level Required Elements:
subject
: ID of the source nodepredicate
: Relationship type from Biolink related_to hierarchyobject
: ID of the target nodeknowledge_level
: Level of knowledge representation according to Biolink Model (see valid values below)agent_type
: Type of autonomous agent that generated the edge according to Biolink Model (see valid values below)
Edge Provenance:
primary_knowledge_source
: The most upstream source of knowledge (e.g., “infores:mondo”)aggregator_knowledge_source
: Intermediate resources through which knowledge passed (e.g., [“infores:biolink-api”])publications
: List of publication CURIEs supporting the edge (e.g., [“PMID:12345678”])
Optional Elements:
Biolink Model properties:
category
,publications
, etc.Note: Non-Biolink Model properties are allowed and won’t violate the specification - this was an intentional design decision to be more inclusive of existing knowledge graphs and allow Biolink to evolve without breaking existing knowledge graphs.
Valid Values for Required Edge Properties:
KnowledgeLevelEnum - Indicates the level of knowledge representation:
knowledge_assertion
: A statement of purported fact based on assessment of direct evidencelogical_entailment
: A conclusion that follows logically from established factsprediction
: A statement of possible fact based on probabilistic reasoningstatistical_association
: Reports concepts/variables as statistically associated in a datasetobservation
: Reports (and possibly quantifies) a phenomenon observed to occurnot_provided
: Knowledge level cannot be determined
AgentTypeEnum - Indicates the type of autonomous agent that generated the edge:
manual_agent
: Human-driven agentautomated_agent
: Fully automated processdata_analysis_pipeline
: Structured data processing pipelinecomputational_model
: Model-based computationtext_mining_agent
: Text mining/NLP processesimage_processing_agent
: Image analysis processesmanual_validation_of_automated_agent
: Human validation of automated resultsnot_provided
: Agent type information not available
Important Notes on Deprecated Properties:
relation
(DEPRECATED for edges): Therelation
slot is deprecated in the Biolink Model. While KGX still supports it for backwards compatibility, new implementations should usepredicate
instead. Thepredicate
property represents high-level semantic relationships from the Biolink Model hierarchy, whilerelation
was originally intended for more specific ontological relations (e.g., from the Relations Ontology).provided_by
vs knowledge source properties:provided_by
is a node-only property representing sources that created or assembled the nodeFor edges, use
primary_knowledge_source
andaggregator_knowledge_source
insteadThis distinction is critical for proper knowledge provenance tracking
When using KGX as a serialization framework (e.g. the “Transform” operations), note that KGX will try to add required properties with default values when not provided by the user. It will also assign Biolink categories to nodes if not provided by the user. This is done to ensure that the resulting knowledge graph is Biolink Model compliant.
Format Serializations#
KGX supports multiple serialization formats for knowledge graphs. KGX also has a very lightweight schema that imports the Biolink Model and makes two key adjustments to Biolink’s class hierarchy: it adds an is_a relationship between “biolink:NamedThing” and “kgx:Node” and an is_a relationship between “biolink:Association” and “kgx:Edge”
For more information and examples of the KGX overlay schema, please see: KGX Schema Generation. For convenience, this is the base KGX schema:
imports:
- linkml:types
- https://w3id.org/biolink/biolink-model
classes:
KnowledgeGraph:
description: A knowledge graph represented in KGX format
slots:
- nodes
- edges
Node:
description: A node in a KGX graph, superclass for NamedThing
slots:
- id
- name
- description
- category
- xref
- provided by
# ... other node slots ...
Edge:
description: An edge in a KGX graph, superclass for Association
slots:
- id
- subject
- predicate
- object
- relation
- category
- provided by
- knowledge source
# ... other edge slots ...
slots:
nodes:
range: Node
multivalued: true
inlined: true
edges:
range: Edge
multivalued: true
inlined: true
KGX format as JSON#
{
"nodes" : [
{
"id": "HGNC:11603",
"name": "TBX4",
"category": ["biolink:Gene"],
"provided_by": ["infores:hgnc"]
},
{
"id": "MONDO:0005002",
"name": "chronic obstructive pulmonary disease",
"category": ["biolink:Disease"],
"provided_by": ["infores:mondo"]
}
],
"edges" : [
{
"id": "urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e",
"subject": "HGNC:11603",
"predicate": "biolink:contributes_to",
"object": "MONDO:0005002",
"category": ["biolink:GeneToDiseaseAssociation"],
"primary_knowledge_source": ["infores:mondo"],
"publications": ["PMID:26634245", "PMID:26634244"]
}
]
}
KGX format as TSV#
KGX TSV format uses two files - one for nodes and another for edges.
nodes.tsv
id category name provided_by
HGNC:11603 biolink:NamedThing|biolink:BiologicalEntity|biolink:Gene TBX4 infores:gwascatalog
MONDO:0005002 biolink:NamedThing|biolink:BiologicalEntity|biolink:DiseaseOrPhenotypicFeature|biolink:Disease chronic obstructive pulmonary disease infores:gwascatalog
edges.tsv
id subject predicate object relation primary_knowledge_source category publications
urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e HGNC:11603 biolink:contributes_to MONDO:0005002 RO:0003304 infores:gwascatalog biolink:GeneToDiseaseAssociation PMID:26634245|PMID:26634244
Notes:
Multi-valued fields use pipe (
|
) as delimiterTSV can be sparse when some nodes have specialized properties
Column ordering can be inconsistent for non-core properties
KGX format as JSON Lines#
The JSON Lines format provides a simple and efficient way to represent KGX data where each line contains a single JSON object representing either a node or an edge. This format combines the advantages of JSON (flexible schema, native support for lists and nested objects) with the streaming capabilities of line-oriented formats.
File Structure#
A KGX JSON Lines bundle typically consists of two files:
{filename}_nodes.jsonl
: Contains one node per line, each as a complete JSON object{filename}_edges.jsonl
: Contains one edge per line, each as a complete JSON object
Together, these two files represent a complete KnowledgeGraph
as defined in the Biolink Model JSON Schema.
While the JSON format combines both nodes and edges in a single file under the KnowledgeGraph
structure,
the JSON Lines format splits them into separate files for efficient streaming and processing. Both approaches
validate against the same Biolink Model schema.
Node Record Format#
Required Properties#
id
(string): A CURIE that uniquely identifies the node in the graphcategory
(array of strings): List of Biolink categories for the node, from the NamedThing hierarchy
Common Optional Properties#
name
(string): Human-readable name of the entitydescription
(string): Human-readable description of the entityprovided_by
(array of strings): List of sources that provided this nodexref
(array of strings): List of database cross-references as CURIEssynonym
(array of strings): List of alternative names for the entity
Edge Record Format#
Required Properties#
subject
(string): CURIE of the source nodepredicate
(string): Biolink predicate representing the relationship typeobject
(string): CURIE of the target nodeknowledge_level
(string): Level of knowledge representation (observation, assertion, concept, statement) according to Biolink Modelagent_type
(string): Autonomous agents for edges (informational, computational, biochemical, biological) according to Biolink Model
Common Optional Properties#
id
(string): Unique identifier for the edge, often a UUIDrelation
(string): Relation CURIE from a formal relation ontology (e.g., RO)category
(array of strings): List of Biolink association categoriesknowledge_source
(array of strings): Sources of knowledge (deprecated:provided_by
)primary_knowledge_source
(array of strings): Primary knowledge sourcesaggregator_knowledge_source
(array of strings): Knowledge aggregator sourcespublications
(array of strings): List of publication CURIEs supporting the edge
Examples#
Node Example (nodes.jsonl):
Each line in a nodes.jsonl file represents a complete node record. Here are examples of different node types:
Example 1: Gene node with organism taxon
{
"id": "HGNC:11603",
"name": "TBX4",
"symbol": "TBX4",
"category": ["biolink:Gene"],
"in_taxon": ["NCBITaxon:9606"],
"in_taxon_label": "Homo sapiens",
"xref": ["ENSEMBL:ENSG00000121075", "NCBIGENE:9496"],
"provided_by": ["infores:hgnc"]
}
Example 2: Disease node
{
"id": "MONDO:0005002",
"name": "chronic obstructive pulmonary disease",
"category": ["biolink:Disease"],
"synonym": ["COPD", "chronic obstructive airway disease"],
"xref": ["DOID:3083", "UMLS:C0024117"],
"provided_by": ["infores:mondo"]
}
Example 3: Clinical trial node with URL
{
"id": "CLINICALTRIALS:NCT01015638",
"name": "Study of Drug X for Treatment Y",
"category": ["biolink:ClinicalTrial"],
"url": "https://clinicaltrials.gov/ct2/show/NCT01015638",
"provided_by": ["infores:clinicaltrials"]
}
Example 4: Publication node
{
"id": "PMID:12345678",
"name": "Genetic basis of disease X",
"category": ["biolink:Publication"],
"url": "https://pubmed.ncbi.nlm.nih.gov/12345678/",
"provided_by": ["infores:pubmed"]
}
In the actual jsonlines file, each record would be on a single line without comments and formatting:
{"id":"HGNC:11603","name":"TBX4","symbol":"TBX4","category":["biolink:Gene"],"in_taxon":["NCBITaxon:9606"],"in_taxon_label":"Homo sapiens","xref":["ENSEMBL:ENSG00000121075","NCBIGENE:9496"],"provided_by":["infores:hgnc"]}
{"id":"MONDO:0005002","name":"chronic obstructive pulmonary disease","category":["biolink:Disease"],"synonym":["COPD","chronic obstructive airway disease"],"xref":["DOID:3083","UMLS:C0024117"],"provided_by":["infores:mondo"]}
{"id":"CLINICALTRIALS:NCT01015638","name":"Study of Drug X for Treatment Y","category":["biolink:ClinicalTrial"],"url":"https://clinicaltrials.gov/ct2/show/NCT01015638","provided_by":["infores:clinicaltrials"]}
{"id":"PMID:12345678","name":"Genetic basis of disease X","category":["biolink:Publication"],"url":"https://pubmed.ncbi.nlm.nih.gov/12345678/","provided_by":["infores:pubmed"]}
Edge Example (edges.jsonl):
Each line in a jsonlines file represents a complete edge record. Here are examples of different edge types:
{
"id": "a8575c4e-61a6-428a-bf09-fcb3e8d1644d",
"subject": "HGNC:11603",
"object": "MONDO:0005002",
"predicate": "biolink:related_to",
"knowledge_level": "knowledge_assertion",
"agent_type": "computational_model"
}
{
"id": "urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e",
"subject": "HGNC:11603",
"predicate": "biolink:contributes_to",
"object": "MONDO:0005002",
"category": [
"biolink:GeneToDiseaseAssociation"
],
"primary_knowledge_source": [
"infores:gwas-catalog"
],
"publications": [
"PMID:26634245",
"PMID:26634244"
],
"knowledge_level": "observation",
"agent_type": "manual_agent"
}
{
"id": "c7d632b4-6708-4296-9cfe-44bc586d32c8",
"subject": "CHEBI:15365",
"predicate": "biolink:affects",
"object": "GO:0006915",
"category": [
"biolink:ChemicalToProcessAssociation"
],
"primary_knowledge_source": [
"infores:monarchinitiative"
],
"aggregator_knowledge_source": [
"infores:biolink-api"
],
"publications": [
"PMID:12345678"
],
"knowledge_level": "prediction",
"agent_type": "text_mining_agent"
}
In the actual jsonlines file, each record would be on a single line without comments and formatting:
{"id":"a8575c4e-61a6-428a-bf09-fcb3e8d1644d","subject":"HGNC:11603","object":"MONDO:0005002","predicate":"biolink:related_to","knowledge_level":"knowledge_assertion","agent_type":"computational_model"}
{"id":"urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e","subject":"HGNC:11603","predicate":"biolink:contributes_to","object":"MONDO:0005002","category":["biolink:GeneToDiseaseAssociation"],"primary_knowledge_source":["infores:gwas-catalog"],"publications":["PMID:26634245","PMID:26634244"],"knowledge_level":"observation","agent_type":"manual_agent"}
nodes.jsonl
{"id":"HGNC:11603","name":"TBX4","category":["biolink:Gene"]}
{"id":"MONDO:0005002","name":"chronic obstructive pulmonary disease","category":["biolink:Disease"]}
{"id":"CHEBI:15365","name":"acetaminophen","category":["biolink:SmallMolecule","biolink:ChemicalEntity"]}
edges.jsonl
{"id":"urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e","subject":"HGNC:11603","predicate":"biolink:contributes_to","object":"MONDO:0005002","relation":"RO:0003304","category":["biolink:GeneToDiseaseAssociation"],"primary_knowledge_source":["infores:gwas-catalog"],"publications":["PMID:26634245","PMID:26634244"],"knowledge_level":"observation","agent_type":"biological"}
Usage Notes#
All field values should follow the KGX specification and Biolink Model requirements
Arrays should be represented as JSON arrays (not pipe-delimited strings)
For large KGs, JSON Lines offers better streaming performance than monolithic JSON
KGX format as RDF Turtle#
@prefix OBO: <http://purl.obolibrary.org/obo/> .
@prefix biolink: <https://w3id.org/biolink/vocab/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e> rdf:object OBO:MONDO_0005002 ;
rdf:predicate biolink:contributes_to ;
rdf:subject <http://identifiers.org/hgnc/11603> ;
biolink:category biolink:GeneToDiseaseAssociation ;
biolink:provided_by <https://archive.monarchinitiative.org/201806/gwascatalog> ;
biolink:publications <http://www.ncbi.nlm.nih.gov/pubmed/26634244>,
<http://www.ncbi.nlm.nih.gov/pubmed/26634245> ;
biolink:relation OBO:RO_0003304 .
<http://identifiers.org/hgnc/11603> rdfs:label "TBX4"^^xsd:string ;
biolink:category biolink:Gene ;
biolink:contributes_to OBO:MONDO_0005002 ;
biolink:provided_by <https://archive.monarchinitiative.org/201806/gwascatalog> .
OBO:MONDO_0005002 rdfs:label "chronic obstructive pulmonary disease"^^xsd:string ;
biolink:category biolink:Disease ;
biolink:primary_knowledge_source <https://archive.monarchinitiative.org/201806/gwascatalog> .