Skip to content

Reference Ingest Guide Schema: A schema for describing the scope, rationale, and modeling approach for ingesting content from an external resource to a data repository compliant with the Biolink Model.

Classes

Class Description
EdgeType A structure for describing each type of edge (metaedge) created in the target knowledge graph by this ingest, including what types of edge properties it holds, and a brief explanation of why this modeling pattern was deemed appropriate to represent the source data (for display to end users in a UI system). Note that an edge type with multiple subject_category and object_category values does not mean that the full cross-product must be instantiated in the data. e.g. A KG with the edge type sub_cat: [Gene, Protein, Small Molecule], predicate: affects, obj_cat: [disease, phenotype, symptom] might include Protein-affects-Disease and Gene-affects-Symptom edges, but not include any Protein-affects-Symptom edges.
FilteredContent A structure for describing the types of records from each relevant file/endpoint/table not included in the ingest, and the rationale for any filtering rules or exclusion criteria. Only list a file if some but not all records it contains are included in the ingest - to document what subset was excluded, and why.
FutureContentConsiderations A structure for collecting notes about content additions or changes to consider in future iterations of this ingest. Create separate objects for each distinct consideration, and indicate if it relates to content that will be captured as Edges, Node Properties, or Edge Properties in the target knowledge graph.
FutureModelingConsiderations A structure for collecting discrete considerations about modeling changes to consider in future iterations of this ingest. Create and categorize separate objects for each distinct consideration.
IncludedContent A structure for describing the types of records from relevant files/endpoints/tables are included in this ingest, and optionally a list of fields from these records that are part of the ingest or used to inform it.
IngestInformation A container for capturing information about the rationale and scope of an ingest, including what source content was included and excluded from the ingest, and what additional content might be considered in future iterations.
NodeType A structured object describing each type of node created in the target knowledge graph by this ingest.
ProvenanceInformation A container holding information about the provenance of the ingest, including who contributed and how, and links to external provenance artifacts.
Qualifier A qualifier property + value tuple that specifies a type of qualifier and value that may be applied to a Statement. Qualfied predicates are considered qualifiers and captured here as well. Values can be specified to come from a proper range (e.g. Biolink class, data type, or enum), come from an enumerated list, use a specific id prefix, and/or conform informally to a free-text description.
ReferenceIngestGuide A container that holds attributes for the discrete sections of information comprising a Resource Ingest Guide.
RelevantFiles A structure for describing each source file (or API endpoint, database, or table) that contains content in scope for the ingest. Source files containing which content is not retrieved in this ingest need not be listed or described.
SourceInformation A container for capturing information about the source of the ingest.
SupportingDataSourceInformation A container for information about upstream sources of data that are used by an ingested source, to derive the knowledge that we ingest. This info is not relevant for typical ingest of an external knowledge source, and applies mainly for describing "ingest" of data-derived KPs like ICEES, COHD, various Multiomics KPs, etc.
TargetInformation A structure for capturing information about the target dataset / knowledge graph output by the ingest, including what types of edges and nodes were produced, modeling rationale, and what modeling changes might be considered in future iterations.
TermsOfUseInformation A structured representation describing terms for re-use of data from an information resource.

Slots

Property Description
additional_notes None
--- ---
agent_type The agent type (or types) relevant to this type of edge. Multivalued only if instances of this type of edge can have different agent types in the data.
--- ---
artifacts Links to and descriptions of external artifacts related to the provenance of the ingest, such as Github issues, surveys of prior ingests of the source, etc.
--- ---
category The category of content described by a given consideration, based on how the content will be represented in the target graph (e.g. "Edges", 'Node Properties", "Edge Properties").
--- ---
citations One ore more citations to publications describing the source. May be a identifier (e.g a pmid or doi), a url to the published document, or a free-text citation.
--- ---
consideration A description of what additional content should be considered and why.
--- ---
contributions The name of a person making a contribution, and the type of contribution made. e.g. "code author", "code support", "data modeling", "domain expertise".
--- ---
data_access_locations Where the source data that is being ingested can be accessed. Provide one or more URLs, along with optional descriptions of what each URL provides.
--- ---
data_formats The format(s) in which the data is serialized for retrieval and use.
--- ---
data_provision_mechanisms How the source distributes their data (file download, API endpoints, database dump).
--- ---
data_versioning_and_releases A description of how releases are versioned and managed by the source (e.g. general approach, frequency, other important considerations). May also include links to web pages describing such information.
--- ---
description A brief description of the source, including its purpose, scope, and any relevant background information.
--- ---
edge_properties A list of one or more Biolink edge properties used in instances of this edge type in the data.
--- ---
edge_type_info A description of each type of edge (metaedge) created in the target knowledge graph by this ingest, including what types of edge properties it holds, and a brief explanation of why this modeling pattern was deemed appropriate to represent the source data (to be displayed for end users in a UI system).
--- ---
fields_used Optional list of the specific source fields that are part of or inform the ingest.
--- ---
file_name The name of the relevant file (or endpoint, or table).
--- ---
filtered_content A description of what types of records from each relevant file are not included in the ingest, and the rationale for any filtering rules or exclusion criteria. Only list a file if some but not all records it contains are included in the ingest - to document what subset was excluded, and why.
--- ---
filtered_records A description of what types of records were excluded from the ingest, in terms of filtering rules or exclusion criteria.
--- ---
future_considerations Notes about content additions or changes to consider in future iterations of this ingest. Separately consider content that will be represented as Edges vs Node Properties vs Edge Properties in the target knowledge graph.
--- ---
included_content A description of what types of records from relevant files/endpoints/tables above are included in this ingest, and optionally a list of fields from these records that are part of the ingest or used to inform it.
--- ---
included_records A description of the types of records that are included in the ingest.
--- ---
infores_id The infores identifier of the source from which content is being ingested, e.g. "infores:ctd".
--- ---
ingest_categories A term or terms indicating the type of source being ingested, from the perspective of the ingesting system (e.g. primary knowledge provider, supporting data provider, ontology/terminology provider).
--- ---
ingest_info Information about the rationale and scope of an ingest, including what source content was included and excluded from the ingest, and what additional content might be considered in future iterations.
--- ---
knowledge_level The knowledge level (or levels) relevant to this type of edge. Multivalued only if instances of this type of edge can have different knowledge levels in the data.
--- ---
license_name The name of an established license used by the source (e.g. "CC BY 4.0")
--- ---
license_url The url of an established license (e.g. "https://creativecommons.org/licenses/by/4.0/")
--- ---
location The URL of a web page or ftp site where the indicated file (or endpoint or table) was accessed.
--- ---
name A human readable name for the RIG.
--- ---
node_category The high-level Biolink category of nodes as assumed or assigned by ingestors. e.g. "biolink:Gene". Note that downstream normalization of node identifiers may result in new/different categories ultimately being assigned in the final graph.
--- ---
node_properties A list of one or more Biolink node properties used in instances of this node type in the data.
--- ---
node_type_info A description of each type of node created in the target knowledge graph by this ingest, in terms of the high-level Biolink categor(ies) of nodes as assumed or assigned by ingestors. Note however that downstream normalization of node identifiers may result in new/different categories ultimately being assigned in the final graph.
--- ---
object_categories The Biolink category of the object node of this edge type. e.g. "biolink:Disease". If two edge types differ only in their object category, but use the same predicate, subject_category, edge properties, and general provenance, they can be described together in a single NodeType object that captures the alternative object categories. e.g. if a source provides Gene-associated_with-Disease and Gene-associated_with-PhenotypicFeature edge types, these can be described in a single EdgeType object with two subject categories (Disease and PhenotypicFeature)
--- ---
predicate The Biolink predicate that defines this type of edge (e.g. "biolink:treats)
--- ---
property The Biolink qualifier slot that defines the kind of Qualifier specified, e.g. "biolink:subject_aspect_qualifier", "qualified_predicate".
--- ---
provenance_info Information about the provenance of the ingest, including who contributed and how, and links to external provenance-related artifacts (e.g. Github tickets, ingest surveys, etc.)
--- ---
qualifiers If relevant, report any qualifiers applied to the edge type, as a Qualifier object that contains a qualifier_property and qualifier_range pair. e.g. the property "biolink:subject_aspect_qualifier", and range "biolink:GeneOrGeneProductOrChemicalEntityAspectEnum
--- ---
rationale The rationale for excluding the indicated content (why this subset of records was filtered out).
--- ---
relevant_files A description of each source file (or API endpoint, database, or table) that contains data used to create the ingested knowledge. Source files that dontain data not used to created knowledge need not be listed or described.
--- ---
scope A short, high-level narrative describing of the types of knowledge form the source that are included and excluded in this ingest.
--- ---
source_identifier_types The type of identifier(s) used for this category of entity by the source system. Report as a prefix for an identifier system where appropriate/possible (preferably a prefix as cataloged in the Biolink prefix map here: https://github.com/biolink/biolink-model/blob/master/project/prefixmap/biolink-model-prefix-map.json). e.g. "MESH", "CTD", "ECTO". If prefix for a public system/database is not in the prefix map, you may make a PR to add it. If the identifiers used are bespoke, or no identifiers are used, the value can be a free text description. e.g. "The source uses entity names but does not assign identifiers".
--- ---
source_info Information about the source from which content is ingested.
--- ---
subject_categories The Biolink category of the subject node of this edge type. e.g. "biolink:SmallMolecule". If two edge types differ only in their subject category, but use the same predicate, object_category, edge properties, and general provenance, they can be described together in a single NodeType object that captures the alternative subject categories. e.g. if a source provides SmallMolecule-treats-Disease and MolecularMixture-treats-Disease edge types, these can be described in a single EdgeType object with two subject categories (SmallMolecule and MolecularMixture).
--- ---
supporting_data_source_info Information about upstream sources of data that are used by an ingested source, to derive the knowledge that we ingest.
--- ---
target_info Information about the dataset / knowledge graph output by the ingest, including what types of edges and nodes were produced, modeling rationale, and what modeling changes might be considered in future iterations.
--- ---
terms_of_use_description A free text description of the terms of use for a source. (e.g. "Source only indicates 'all rights reserved' in their documentation")
--- ---
terms_of_use_info Information about conditions for use of the ingested source. May include the name of a community license (e.g. CC-BY 4.0), a link to a "terms of use" or license information web page (e.g. https://ctdbase.org/about/legal.jsp), and/or a free-text summary of key terms of use.
--- ---
terms_of_use_url The url of a document or web page where a source describes its terms of use, and/or references a community license that it adopts. (e.g. "https://ctdbase.org/about/legal.jsp")
--- ---
ui_explanation A brief explanation of why this modeling pattern was deemed appropriate to represent the source data (for display to end users in a UI system).
--- ---
utility Brief description of why the source was ingested, and the utility of the data it provides for target system use cases.
--- ---
value_description A free text description of the tyeps of value allowed for the qualifier.
--- ---
value_enumeration A set of one or more specific values for the qualifier in an Edge type (e.g. ["biolink:causes"] as the only value for the "biolink:qualified_predicate" qualifier property, ["activity_or_abundance", "activity", "abundance"] as the values for the "biolink:object_aspect_qualifier" property).
--- ---
value_id_prefixes One or more id prefixes from which the qualifier value mush come. e.g. "HP" if the qualifier must be a Human Phenotype Ontology term.
--- ---
value_range The Biolink class(es) or type(s) that specifies the kind of calue the qualifier property takes, Reported as the name of a Biolink class, enumeration, or data type, as appropriate. e.g. "biolink:Disease", "biolink:GeneOrGeneProductOrChemicalEntityAspectEnum", "biolink:string"

Enumerations

Enumeration Description
AgentTypeEnum Agent types relevant to edges of a particular type.
ContentCategoryEnum Categories of content for future considerations
DataFormatEnum Formats in which data is serialized.
IngestCategoryEnum The type of source being ingested, from the perspective of the ingesting system.
KnowledgeLevelEnum Knowledge levels relevant to edges of a particular type.
ModelingCategoryEnum Categories of future modeling considerations (what type of modeling the consideration is about).
ProvisionMechanismEnum Ways in which data can be made accessible for retrieval.

Subsets

Subset Description

Unni DR, Moxon SAT, Bada M, Brush M, Bruskiewich R, Caufield JH, Clemons PA, Dancik V, Dumontier M, Fecho K, Glusman G, Hadlock JJ, Harris NL, Joshi A, Putman T, Qin G, Ramsey SA, Shefchek KA, Solbrig H, Soman K, Thessen AE, Haendel MA, Bizon C, Mungall CJ, The Biomedical Data Translator Consortium (2022). Biolink Model: A universal schema for knowledge graphs in clinical, biomedical, and translational science. Clin Transl Sci. Wiley; 2022 Jun 6; https://onlinelibrary.wiley.com/doi/10.1111/cts.13302