Writing a Reference Ingest Guide (RIG)

A Reference Ingest Guide (RIG) is a structured document that describes the scope, rationale, and modeling approach for ingesting content from an external source to a data repository compliant with the Biolink Model. This guide will walk you through creating a RIG using the provided schema and template.

Overview

A RIG consists of four main sections:

Source Information - Details about the data source
Ingest Information - What content is included/excluded and why
Target Information - How the data is modeled in the output graph
Provenance Information - Who contributed and relevant artifacts

Getting Started

Start with the template file rig_template.yaml and fill in each section according to your data source. The template provides the complete structure with comments indicating required vs optional fields.

Section 1: Source Information

Document key details about your data source:

source_info:
  infores_id: "infores:your-source-id"  # Required: InfoRes identifier
  description: "Brief description of the source"  # Optional but recommended
  citations:  # Optional: Publications about the source
    - "PMID:12345678"
    - "https://doi.org/10.1000/example"
  terms_of_use: "CC-BY 4.0"  # Required: License or terms
  data_access_locations:  # Optional: Where to access the data
    - "https://example.com/download"
  data_provision_mechanisms:  # Optional: How data is distributed
    - file_download
    - api_endpoint
  data_formats:  # Optional: Data formats available
    - json
    - csv
  data_versioning_and_releases: "Monthly releases with semantic versioning"  # Optional
  additional_notes: "Any other relevant information"  # Optional

Section 2: Ingest Information

Describe what content you're ingesting and why:

ingest_info:
  ingest_categories:  # Optional: Type of source
    - primary_knowledge_provider
  utility: "Why this data is valuable for your use case"  # Required
  scope: "High-level description of what's included/excluded"  # Optional but recommended

  relevant_files:  # Required: Source files being processed
    - file_name: "data.json"
      location: "https://example.com/data.json"
      description: "Main dataset containing..."

  included_content:  # Optional: What records are included
    - file_name: "data.json"
      included_records: "All gene-disease associations with evidence scores > 0.5"
      fields_used: "gene_id, disease_id, evidence_score, publication_refs"

  filtered_content:  # Optional: What's excluded and why
    - file_name: "data.json"
      filtered_records: "Associations with evidence scores <= 0.5"
      rationale: "Low confidence associations excluded to maintain data quality"

  future_considerations:  # Optional: Future content to consider
    - category: edge_content
      consideration: "Include pathway information when available"
      relevant_files: "pathway_data.json"

Section 3: Target Information

Describe the output graph structure:

target_info:
  infores_id: "infores:[source-abbreviation]"  # Optional: Target resource identifier

  edge_type_info:  # Required: Types of edges created
    - subject_categories:
        - "biolink:Gene"
      predicate: "biolink:associated_with"
      object_categories:
        - "biolink:Disease"
      knowledge_level:
        - knowledge_assertion
      agent_type:
        - manual_agent
      edge_properties:
        - "biolink:evidence_count"
        - "biolink:publications"
      ui_explanation: "Gene-disease associations curated from literature with evidence scores"

  node_type_info:  # Required: Types of nodes created
    - node_category: "biolink:Gene"
      source_identifier_types: "NCBIGene"
      node_properties:
        - "biolink:name"
        - "biolink:synonym"
    - node_category: "biolink:Disease"
      source_identifier_types: "MONDO"
      node_properties:
        - "biolink:name"

Section 4: Provenance Information

Document contributors and related artifacts:

provenance_info:  # Optional but recommended
  contributions:
    - "Jane Doe - code author"
    - "John Smith - domain expertise"
  artifacts:
    - "GitHub issue: https://github.com/NCATSTranslator/translator-ingests/issues/123"
    - "Ingest survey: https://docs.google.com/document/xyz"

Example Structure

ReferenceIngestGuide:
  name: "[source-name] RIG"
  source_info: { ... }
  ingest_info: { ... }
  target_info: { ... }
  provenance_info: { ... }

The complete template in rig_template.yaml provides the full structure with all available fields and their data types. Use this as your starting point and fill in the relevant sections for your specific data source.