# KGX and Biolink Model JSON Schema Validation ## TL;DR - For the Skeptics **KGX is simply JSON Schema validation of JSON Lines data conformant to the Biolink Model JSON Schema.** That's it. Nothing more, nothing less. KGX JSON Lines format is just Biolink Model-compliant JSON objects, one per line. --- ## Understanding KGX Validation ### The Simple Truth KGX doesn't invent a new schema. KGX uses the **official Biolink Model JSON Schema** to validate knowledge graph data. When you serialize data in KGX JSON Lines format, you are creating JSON objects that validate against: **[Biolink Model JSON Schema](https://w3id.org/biolink/biolink-model/biolink-model.json)** ### How It Works 1. **Biolink Model** defines the schema for nodes (NamedThing) and edges (Association) 2. **KGX** serializes this data as: - **JSON**: Standard JSON with `nodes` and `edges` arrays - **JSON Lines**: One JSON object per line (streaming-friendly) - **TSV**: Tabular format with pipe-delimited multi-valued fields - **RDF**: Semantic web format 3. **Validation**: Each node and edge object conforms to the Biolink Model JSON Schema ### The Schema Hierarchy ``` Biolink Model JSON Schema (https://w3id.org/biolink/biolink-model/biolink-model.json) ↓ Defines: KnowledgeGraph (root class) ↓ KnowledgeGraph has two properties: - nodes: array of NamedThing instances - edges: array of Association instances ↓ KGX validates against these definitions ↓ Your JSON Lines data: One NamedThing or Association per line (split across {filename}_nodes.jsonl and {filename}_edges.jsonl) ``` --- ## Practical Examples ### What You Write (KGX JSON Lines) **nodes.jsonl:** ```json {"id":"HGNC:11603","name":"TBX4","category":["biolink:Gene"],"in_taxon":["NCBITaxon:9606"]} ``` **edges.jsonl:** ```json {"subject":"HGNC:11603","predicate":"biolink:contributes_to","object":"MONDO:0005002","knowledge_level":"observation","agent_type":"manual_agent"} ``` ### What Validates It The [Biolink Model JSON Schema](https://w3id.org/biolink/biolink-model/biolink-model.json) defines: - **Gene** (inherits from NamedThing) - Required: `id`, `category` - Properties: `name`, `symbol`, `in_taxon`, `xref`, etc. - **Association** (base class for all edges) - Required: `subject`, `predicate`, `object`, `knowledge_level`, `agent_type` - Properties: `publications`, `primary_knowledge_source`, `category`, etc. ### Verification You can validate your KGX JSON Lines data yourself using any JSON Schema validator: ```bash # Validate a node against Biolink Model schema cat nodes.jsonl | head -1 | \ ajv validate -s https://w3id.org/biolink/biolink-model/biolink-model.json -d - ``` --- ## Why JSON Lines? JSON Lines (`.jsonl`) is simply newline-delimited JSON. Each line is a complete, valid JSON object. **Advantages:** - **Streaming**: Process one record at a time (memory-efficient for large KGs) - **Parallel processing**: Split files and process chunks independently - **Append-friendly**: Add new records without rewriting the entire file - **Debugging**: Inspect individual records easily - **Standard format**: Widely supported by data tools (pandas, spark, etc.) **Still JSON Schema Compliant:** Each line is a JSON object that validates against the Biolink Model schema. The newline delimiter doesn't change the schema validation—it just changes how we store multiple objects. **JSON Lines as a Bundle:** When working with KGX JSON Lines format, you typically work with a **bundle** of two files: - `{filename}_nodes.jsonl`: Contains the `nodes` array from `KnowledgeGraph`, with one node per line - `{filename}_edges.jsonl`: Contains the `edges` array from `KnowledgeGraph`, with one edge per line This bundle represents the same `KnowledgeGraph` structure defined in the Biolink Model JSON Schema, but split into separate files for efficient processing. --- ## The Schema Relationship ### Biolink Model JSON Schema Structure The Biolink Model JSON Schema defines a root `KnowledgeGraph` class: ```json { "$defs": { "KnowledgeGraph": { "type": "object", "description": "A knowledge graph represented in KGX format", "properties": { "nodes": { "description": "A list of entities that can be a subject or object of an association", "type": ["array", "null"], "items": { "anyOf": [{ "$ref": "#/$defs/NamedThing" }, ...] } }, "edges": { "description": "A list of associations between two entities", "type": ["array", "null"], "items": { "anyOf": [{ "$ref": "#/$defs/Association" }, ...] } } } }, "NamedThing": { "type": "object", "properties": { "id": { "type": "string", "description": "CURIE identifier" }, "category": { "type": "array", "items": { "type": "string" }, "description": "Biolink categories" }, "name": { "type": "string" }, ... }, "required": ["id", "category"] }, "Gene": { "allOf": [ { "$ref": "#/definitions/NamedThing" }, { "properties": { "in_taxon": { "type": "array", "items": { "type": "string" } }, "symbol": { "type": "string" } } } ] }, "Association": { "type": "object", "properties": { "subject": { "type": "string" }, "predicate": { "type": "string" }, "object": { "type": "string" }, "knowledge_level": { "type": "string", "enum": ["knowledge_assertion", "logical_entailment", ...] }, "agent_type": { "type": "string", "enum": ["manual_agent", "automated_agent", ...] } }, "required": ["subject", "predicate", "object", "knowledge_level", "agent_type"] } } } ``` ### Your KGX Data Validates Against This Every node in your `nodes.jsonl` file must validate against the appropriate `NamedThing` subclass (Gene, Disease, ChemicalEntity, etc.). Every edge in your `edges.jsonl` file must validate against the appropriate `Association` subclass. --- ## Addressing Common Concerns ### "Why not just use the Biolink Model directly?" **You are.** KGX is a toolkit that: - Serializes Biolink Model-compliant data into various formats (JSON, TSV, RDF) - Validates data against the Biolink Model schema - Transforms between formats while maintaining Biolink compliance - Provides utilities for working with Biolink-compliant knowledge graphs ### "Is KGX adding extra requirements?" **No.** KGX follows the Biolink Model requirements exactly. If a property is required in Biolink Model, it's required in KGX. If it's optional in Biolink Model, it's optional in KGX. KGX is intentionally lenient—it allows non-Biolink properties to support knowledge graph evolution and real-world data, but all Biolink properties follow the official specification. ### "What about those agent_type values in the old docs?" **Fixed.** The documentation now correctly reflects the current Biolink Model `AgentTypeEnum` and `KnowledgeLevelEnum` values. See the [updated KGX format documentation](kgx_format.md). ### "How do I know my data is valid?" Three ways: 1. **Use KGX toolkit**: `kgx validate` command checks Biolink compliance 2. **JSON Schema validator**: Validate directly against `https://w3id.org/biolink/biolink-model/biolink-model.json` 3. **LinkML validator**: Use the LinkML tools to validate against the Biolink Model YAML schema --- ## Resources ### Official Biolink Model Resources - **JSON Schema**: [https://w3id.org/biolink/biolink-model/biolink-model.json](https://w3id.org/biolink/biolink-model/biolink-model.json) - **YAML Schema**: [https://w3id.org/biolink/biolink-model.yaml](https://w3id.org/biolink/biolink-model.yaml) - **Documentation**: [https://biolink.github.io/biolink-model/](https://biolink.github.io/biolink-model/) - **GitHub**: [https://github.com/biolink/biolink-model](https://github.com/biolink/biolink-model) ### KGX Resources - **KGX Format Specification**: [kgx_format.md](kgx_format.md) - **KGX Schema Generation**: [kgx_schema_generation.md](kgx_schema_generation.md) - **GitHub**: [https://github.com/biolink/kgx](https://github.com/biolink/kgx) --- ## Example Validation Workflow ### Step 1: Create KGX JSON Lines Data **nodes.jsonl** ```json {"id":"HGNC:11603","name":"TBX4","symbol":"TBX4","category":["biolink:Gene"],"in_taxon":["NCBITaxon:9606"],"in_taxon_label":"Homo sapiens"} {"id":"MONDO:0005002","name":"chronic obstructive pulmonary disease","category":["biolink:Disease"]} ``` **edges.jsonl** ```json {"id":"uuid:123","subject":"HGNC:11603","predicate":"biolink:contributes_to","object":"MONDO:0005002","knowledge_level":"knowledge_assertion","agent_type":"manual_agent","primary_knowledge_source":["infores:hgnc"],"publications":["PMID:12345678"]} ``` ### Step 2: Validate Using KGX ```bash # Validate the data kgx validate --input-format jsonl nodes.jsonl edges.jsonl # Transform and validate in one step kgx transform --input-format jsonl --output-format json \ --input-file nodes.jsonl --input-file edges.jsonl \ --output-file output.json ``` ### Step 3: Verify Against Biolink Schema (Optional) ```bash # Using ajv (Another JSON Schema Validator) npm install -g ajv-cli # Validate individual records cat nodes.jsonl | while read line; do echo "$line" | ajv validate \ -s https://w3id.org/biolink/biolink-model/biolink-model.json \ -d - done ``` --- ## Conclusion **KGX = Biolink Model JSON Schema + Practical Serialization Formats** - Uses official Biolink Model JSON Schema - Provides multiple serialization formats (JSON, JSON Lines, TSV, RDF) - Validates data against Biolink Model requirements - No additional schema overhead - Standard JSON Schema validation applies **Bottom line**: If your JSON Lines data validates against the Biolink Model JSON Schema, it's valid KGX. If it doesn't, it isn't. Simple as that. For detailed property requirements and examples, see the [KGX Format Specification](kgx_format.md).