Skip to content

Reference Ingest Guide Schema: A schema for describing the scope, rationale, and modeling approach for ingesting content from an external source to a data repository compliant with the Biolink Model.

Classes

Class Description
EdgeType A structure for describing each type of edge (metaedge) created in the target knowledge graph by this ingest, including what types of edge properties it holds, and a brief explanation of why this modeling pattern was deemed appropriate to represent the source data (for display to end users in a UI system).
FilteredContent A structure for describing the types of records from each relevant file/endpoint/table not included in the ingest, and the rationale for any filtering rules or exclusion criteria. Only list a file if some but not all records it contains are included in the ingest - to document what subset was excluded, and why.
FutureContentConsiderations A structure for collecting notes about content additions or changes to consider in future iterations of this ingest. Create separate objects for each distinct consideration, and indicate if it relates to content that will be captured as Edges, Node Properties, or Edge Properties in the target knowledge graph.
FutureModelingConsiderations A structure for collecting discrete considerations about modeling changes to consider in future iterations of this ingest. Create and categorize separate objects for each distinct consideration.
IncludedContent A structure for describing the types of records from relevant files/endpoints/tables are included in this ingest, and optionally a list of fields from these records that are part of the ingest or used to inform it.
IngestInformation A container for capturing information about the rationale and scope of an ingest, including what source content was included and excluded from the ingest, and what additional content might be considered in future iterations.
NodeType A structured object describing each type of node created in the target knowledge graph by this ingest.
ProvenanceInformation A container holding information about the provenance of the ingest, including who contributed and how, and links to external provenance artifacts.
ReferenceIngestGuide A container that holds attributes for the discrete sections of information comprising a Reference Ingest Guide.
RelevantFiles A structure for describing each source file (or API endpoint, database, or table) that contains content in scope for the ingest. Source files containing which content is not retrieved in this ingest need not be listed or described.
SourceInformation A container for capturing information about the source of the ingest.
TargetInformation A sturcture for capturing information about the target dataset / knowledge graph output by the ingest, including what types of edges and nodes were produced, modeling rationale, and what modeling changes might be considered in future iterations.

Slots

Property Description
additional_notes None
--- ---
agent_type The agent type (or types) relevant to this type of edge. Multivalued only if instances of this type of edge can have different agent types in the data.
--- ---
artifacts Links to and descriptions of external artifacts related to the provenance of the ingest, such as Github issues, surveys of prior ingests of the source, etc.
--- ---
biolink_qualifier_1 If relevant, a Biolink qualifier property that defines this edge type. e.g. "biolink:subject_form_or_variant_qualifier".
--- ---
biolink_qualifier_1_value_type The type of the value taken by the first qualifier type. Can be reported as the name of a Biolink class, enumeration, or data type, as appropriate. e.g. "biolink:Disease", "biolink:GeneOrGeneProductOrChemicalEntityAspectEnum", "biolink:string".
--- ---
biolink_qualifier_2 If relevant, a second Biolink qualifier property that defines this edge type. e.g. "biolink:subject_aspect_qualifier".
--- ---
biolink_qualifier_2_value_type The type of the value taken by the second qualifier type. Can be reported as the name of a Biolink class, enumeration, or data type, as appropriate. e.g. "biolink:Disease", "biolink:GeneOrGeneProductOrChemicalEntityAspectEnum", "biolink:string".
--- ---
biolink_qualifier_3 IF relevant, a third Biolink qualifier property that defines this edge type. e.g. "biolink:object_direction_qualifier".
--- ---
biolink_qualifier_3_value_type The type of the value taken by the third qualifier type. Can be reported as the name of a Biolink class, enumeration, or data type, as appropriate. e.g. "biolink:Disease", "biolink:GeneOrGeneProductOrChemicalEntityAspectEnum", "biolink:string".
--- ---
category The category of content described by a given consideration, based on how the content will be represented in the target graph (e.g. "Edges", 'Node Properties", "Edge Properties").
--- ---
citations One ore more citations to publications describing the source. May be a identifier (e.g a pmid or doi), a url to the published document, or a free-text citation.
--- ---
consideration A description of what additional content should be considered and why.
--- ---
contributions The name of a person making a contribution, and the type of contribution made. e.g. "code author", "code support", "data modeling", "domain expertise".
--- ---
data_access_locations Where the source data that is being ingested can be accessed. Provide one or more URLs, along with optional descriptions of what each URL provides.
--- ---
data_formats The format(s) in which the data is serialized for retrieval and use.
--- ---
data_provision_mechanisms How the source distributes their data (file download, API endpoints, database dump).
--- ---
data_versioning_and_releases A description of how releases are versioned and managed by the source (e.g. general approach, frequency, other important considerations). May also include links to web pages describing such information.
--- ---
description A brief description of the source, including its purpose, scope, and any relevant background information.
--- ---
edge_properties A list of one or more Biolink edge properties used in instances of this edge type in the data.
--- ---
edge_type_info A description of each type of edge (metaedge) created in the target knowledge graph by this ingest, including what types of edge properties it holds, and a brief explanation of why this modeling pattern was deemed appropriate to represent the source data (to be displayed for end users in a UI system).
--- ---
fields_used Optional list of the specific source fields that are part of or inform the ingest.
--- ---
file_name The name of the relevant file (or endpoint, or table).
--- ---
filtered_content A description of what types of records from each relevant file are not included in the ingest, and the rationale for any filtering rules or exclusion criteria. Only list a file if some but not all records it contains are included in the ingest - to document what subset was excluded, and why.
--- ---
filtered_records A description of what types of records were excluded from the ingest, in terms of filtering rules or exclusion criteria.
--- ---
future_considerations Notes about content additions or changes to consider in future iterations of this ingest. Separately consider content that will be represented as Edges vs Node Properties vs Edge Properties in the target knowledge graph.
--- ---
included_content A description of what types of records from relevant files/endpoints/tables above are included in this ingest, and optionally a list of fields from these records that are part of the ingest or used to inform it.
--- ---
included_records A description of the types of records that are included in the ingest.
--- ---
infores_id The infores identifier of the source from which content is being ingested, e.g. "infores:ctd".
--- ---
ingest_categories A term or terms indicating the type of source being ingested, from the perspective of the ingesting system (e.g. primary knowledge provider, supporting data provider, ontology/terminology provider).
--- ---
ingest_info Information about the rationale and scope of an ingest, including what source content was included and excluded from the ingest, and what additional content might be considered in future iterations.
--- ---
knowledge_level The knowledge level (or levels) relevant to this type of edge. Multivalued only if instances of this type of edge can have different knowledge levels in the data.
--- ---
location The URL of a web page or ftp site where the indicated file (or endpoint or table) was accessed.
--- ---
name A human readable name for the RIG
--- ---
node_category The high-level Biolink category of nodes as assumed or assigned by ingestors. e.g. "biolink:Gene".
--- ---
node_properties A list of one or more Biolink node properties used in instances of this node type in the data.
--- ---
node_type_info A description of each type of node created in the target knowledge graph by this ingest, in terms of the high-level Biolink categor(ies) of nodes as assumed or assigned by ingestors. Note however that downstream normalization of node identifiers may result in new/different categories ultimately being assigned in the final graph.
--- ---
object_categories The Biolink category of the object node of this edge type. e.g. "biolink:Disease".
--- ---
predicate The Biolink predicate that defines this type of edge (e.g. "biolink:treats)
--- ---
provenance_info Information about the provenance of the ingest, including who contributed and how, and links to external provenance-related artifacts (e.g. Github tickets, ingest surveys, etc.
--- ---
qualified_predicate The Biolink predicate used in a "qualified" reading of a statement expressed by an Edge (i.e. one that considers the semantics added by subject and/or object qualifiers)
--- ---
rationale The rationale for excluding the indicated content (why this subset of records was filtered out).
--- ---
relevant_files A description of each source file (or API endpoint, database, or table) that contains content in scope for the ingest. Source files containing which content is not retrieved in this ingest need not be listed or described.
--- ---
scope A short, high-level narrative describing of the types of knowledge form the source that are included and excluded in this ingest.
--- ---
source_identifier_types The type of identifier(s) used for this category of entity by the source system. Report as a prefix for an identifier system where appropriate/possible (preferably a prefix as cataloged in the Biolink prefix registry). e.g. "MESH", "CTD", "ECTO". But can be a free text description if needed. e.g. "The source uses entity names but does not assign identifiers".
--- ---
source_info Information about the source from which content is ingested.
--- ---
subject_categories The Biolink category of the subject node of this edge type. e.g. "biolink:SmallMolecule".
--- ---
target_info Information about the dataset / knowledge graph output by the ingest, including what types of edges and nodes were produced, modeling rationale, and what modeling changes might be considered in future iterations.
--- ---
terms_of_use Information about conditions for use of the ingested source. May be the name of a community license (e.g. CC-BY 4.0), a link to a "terms of use" or license information web page (e.g. https://ctdbase.org/about/legal.jsp), or a free-text summary of key terms of use.
--- ---
ui_explanation A brief explanation of why this modeling pattern was deemed appropriate to represent the source data (for display to end users in a UI system).
--- ---
utility Brief description of why the source was ingested, and the utility of the data it provides for target system use cases.

Enumerations

Enumeration Description
AgentTypeEnum Agent types relevant to edges of a particular type.
ContentCategoryEnum Categories of content for future considerations
DataFormatEnum Formats in which data is serialized.
IngestCategoryEnum The type of source being ingested, from the perspective of the ingesting system.
KnowledgeLevelEnum Knowledge levels relevant to edges of a particular type.
ModelingCategoryEnum Categories of future modeling considerations (what type of modeling the consideration is about).
ProvisionMechanismEnum Ways in which data can be made accessible for retrieval.

Subsets

Subset Description

Unni DR, Moxon SAT, Bada M, Brush M, Bruskiewich R, Caufield JH, Clemons PA, Dancik V, Dumontier M, Fecho K, Glusman G, Hadlock JJ, Harris NL, Joshi A, Putman T, Qin G, Ramsey SA, Shefchek KA, Solbrig H, Soman K, Thessen AE, Haendel MA, Bizon C, Mungall CJ, The Biomedical Data Translator Consortium (2022). Biolink Model: A universal schema for knowledge graphs in clinical, biomedical, and translational science. Clin Transl Sci. Wiley; 2022 Jun 6; https://onlinelibrary.wiley.com/doi/10.1111/cts.13302