KGX Validator#

The KGX Validator is a tool that validates graph data for compliance with the Biolink Model. It performs validation across multiple facets of the knowledge graph, helping ensure data adheres to established standards.

Key Validation Features#

The validator checks the following aspects of your knowledge graph:

Node Validation#

  • Required Properties: Verifies that all required node properties (as defined by Biolink Model) are present

  • Property Types: Ensures node properties have the expected value types (string, URI/CURIE, double, list)

  • Property Values: Validates that node identifiers are properly formed CURIEs with prefixes defined in the Biolink Model JSON-LD context

  • Categories: Validates that node categories exist in the Biolink Model, are properly formatted in CamelCase

Edge Validation#

  • Required Properties: Verifies that all required edge properties are present

  • Property Types: Ensures edge properties have the expected value types

  • Property Values: Validates that subject and object properties are properly formed CURIEs with prefixes defined in the Biolink Model

  • Predicates: Validates that edge predicates exist in the Biolink Model and follow proper snake_case formatting

Error Tracking#

The validator captures and reports the following types of errors:

  • Missing required properties

  • Invalid property types

  • Invalid property values

  • Invalid categories and predicates

  • Missing or malformed CURIEs

Usage Examples#

Basic Validation#

from kgx.validator import Validator
from kgx.transformer import Transformer

# Load a graph
transformer = Transformer()
transformer.transform('input.json')
graph = transformer.graph

# Validate the graph
validator = Validator()
validator.validate(graph)

# Get validation errors
errors = validator.get_errors()

Streaming Validation for Large Graphs#

For very large graphs, use the streaming mode to minimize memory usage:

kgx validate --input-format tsv --stream=True input_nodes.tsv input_edges.tsv

Command Line Interface#

KGX provides a command-line interface for validation:

kgx validate --input-format tsv \
             --biolink-release 3.1.0 \
             --error-log validation_errors.json \
             input_nodes.tsv input_edges.tsv

Understanding Validation Messages#

Validation errors are categorized into different levels:

  • INFO: Informational messages about suggested improvements

  • WARNING: “Should” level recommendations for better compliance

  • ERROR: “Must” level requirements that violate model specifications

Example validation error output:

{
  "ERROR": {
    "INVALID_CATEGORY": {
      "Category 'gene' is not in CamelCase form": ["NCBIGene:1234"]
    },
    "MISSING_NODE_PROPERTY": {
      "Required node property 'category' is missing": ["MONDO:0005737"]
    }
  }
}

Benefits of Validation#

  • Ensures consistency across knowledge graphs

  • Identifies data quality issues early

  • Facilitates data integration from multiple sources

  • Improves interoperability between systems

  • Provides guidance on Biolink Model compliance

By using the KGX Validator, you can ensure your knowledge graphs adhere to the Biolink Model specifications, making them more interoperable and useful for downstream applications.