CTD Resource Ingest Guide

Source Information

InfoRes ID: infores:ctd

Description: CTD is a robust, publicly available database that aims to advance understanding about how environmental exposures affect human health. It provides knowledge, manually curated from the literature, about chemicals and their relationship to other biological entities: chemical to gene/protein interactions plus chemical to disease and gene to disease relationships. These data are integrated with functional and pathway data to aid in the development of hypotheses about the mechanisms underlying environmentally influenced diseases. It also generates novel inferences by further analyzing the knowledge they curate/create - based on statistically significant connections with intermediate concept (e.g. Chemical X associated with Disease Y based on shared associations with a common set of genes).

Citations: - Davis AP, Wiegers TC, Johnson RJ, Sciaky D, Wiegers J, Mattingly CJ Comparative Toxicogenomics Database (CTD). update 2023. Nucleic Acids Res. 2022 Sep 28

Data Access Locations: - CTD Bulk Downloads - http://ctdbase.org/downloads/ (this page includes file sizes and simple data dictionaries for each download) - CTD Catalog - https://ctdbase.org/reports/ (a simple list of files, reports the number of rows in each file)

Data Provision Mechanisms: file_download

Data Formats: tsv, csv, obo, xml

Data Versioning and Releases: No consistent cadence for releases, but on average there are 1-2 releases each month. Versioning is based on the month and year of the release Releases page / change log: https://ctdbase.org/about/changes/ Latest status page: https://ctdbase.org/about/dataStatus.go

Ingest Information

Ingest Categories: primary_knowledge_provider

Utility: CTD is a rich source of manually curated chemical associations to other biological entities which are an important type of edge for Translator query and reasoning use cases, including treatment predictions, chemical-gene regulation predictions, and pathfinder queries. It is one of the few sources that focus on non-drug chemicals, e.g. environmental stressors, and how these are related to diseases, biological processes, and genes.

Scope: This initial ingest of CTD covers curated Chemical to Disease associations that report therapeutic and marker/mechanism relationships, and inferred statistical associations generated by CTD. Additional types of Chemical associations will be added in future updates to the ingest.

Relevant Files

File Name	Location	Description
CTD_chemicals_diseases.tsv.gz	http://ctdbase.org/downloads/	Manually curated and computationally inferred associations between chemicals and diseases
CTD_exposure_events.tsv.gz	http://ctdbase.org/downloads/	Descriptions of statistical studies of how exposure to chemicals affects a particular population, with some records providing outcomes

Included Content

File Name	Included Records	Fields Used
CTD_chemicals_diseases.tsv.gz	Curated therapeutic and marker/mechanism associations (records where a "DirectEvidence" value is populated with type "T" or "M"), as well as inferred associtionas (records lacking a value in the DirectEvidence column)	ChemicalName, ChemicalID, CasRN, DiseaseName, DiseaseID, DirectEvidence, InferenceGeneSymbol, InferenceScore, OmimIDs, PubMedIDs

Filtered Content

File Name	Filtered Records	Rationale
CTD_chemicals_diseases.tsv.gz	None	Currently taking all records with no publication count or inference score cutoffs - but these may be added in future iterations

Future Content Considerations

edge_content: Consider adding some threshold / cutoff to remove lower quality/confidence inferences (e.g. based on shared gene count, publication count, or inference score. At present we include even inferences based on a single shared gene/pub - which is not really meaningful. - Relevant files: CTD_chemicals_diseases.tsv.gz

edge_content: Consider ingesting additional chemical-disease edges reporting statistical correlations from environmental exposure studies from CTD_exposure_events.tsv.gz. This is a unique/novel source for this kind of knowledge, but there is not a lot of data here, and utility is not yet clear. - Relevant files: CTD_exposure_events.tsv.gz

edge_content: While the current ingest includes only Chemical-Disease Associations, future iterations will include additional types of associations between Chemicals and GO Terms, Molecular Phenotypes, Genes, etc. See the Ingest Survey table linked below for more details.

node_property_content: Molepro ingested chemical properties in its previous ingests - which we will likely bring in at some point.

edge_property_content: Consider an edge property that reports the list of shared genes supporting C-D inferred associations. - Relevant files: CTD_chemicals_diseases.tsv.gz

Additional Notes: none

Target Information

Target InfoRes ID: infores:translator-ctd-kgx

Edge Types

Subject Categories	Object Categories	Knowledge Level	Agent Type	UI Explanation
biolink:ChemicalEntity	biolink:DiseaseOrPhenotypicFeature	knowledge_assertion	manual_agent	CTD Chemical-Disease records with a "T" (therapeutic) DirectEvidence code indicate the chemical to be a "potential" treatment in virtue of its clinical use or study - which maps best to the Biolink predicate 'treats_or_applied_or_studied_to_treat'.
biolink:ChemicalEntity	biolink:DiseaseOrPhenotypicFeature	knowledge_assertion	manual_agent	CTD Chemical-Disease records with a DirectEvidence code of "M" (marker/mechanism) indicate that the chemical is manually flagged as a marker or contributing factor for a condition. This implies that at minimum there is correlation between the presence of the chemical and condition, for which we use the Biolink 'correlated_with' predicate.
biolink:ChemicalEntity	biolink:DiseaseOrPhenotypicFeature	statistical_association	data_analysis_pipeline	CTD Chemical-Disease records with an inference score have a statistically significant number of shared gene associations that suggest a biological relationship may exist. The statistical basis of this general inferred relationship is best reported using the Biolink 'assocaited_with' predicate.

Node Types

Node Category	Source Identifier Types	Additional Notes
biolink:ChemicalEntity	MeSH	Majority are Biolink SmallMolecules
biolink:DiseaseOrPhenotypicFeature	MeSH

Future Modeling Considerations

edge_properties: Revisit use of 'has_confidence_score' edge property if/when we refactor this part of the Biolink Model.

predicates: Revisit 'correlated_with' and 'treats_or_studied_or_applied_to_treat' predicates if/when we refactor modeling or conventions here.

Additional Notes: CTD_chemicals_diseases.tsv. data includes one row per curated 'T', or 'M' association with pub reference(s), plus one row per shared gene association with pub reference(s), and inference scores. Separate edges will be created for each type of association reported between a chemical and a given disease, according to the mappings described above. All "shared gene" rows in the source data file for a given C-D pair will be aggregated into a single 'associated_with' edge that reports an associated_with relationship with the inference score as an edge property (and possibly the list of shared genes). This means that for a given C-D pair in the CTD file, there may be 1, 2, or 3 separate edges created in the Translator graph.

Provenance Information

Contributors: - Kevin Schaper: code author - Evan Morris: code support - Sierra Moxon: code support - Vlado Dancik: code support, domain expertise - Matthew Brush: data modeling, domain expertise

Artifacts: - Ingest Survey: https://docs.google.com/spreadsheets/d/1R9z-vywupNrD_3ywuOt_sntcTrNlGmhiUWDXUdkPVpM/edit?gid=0#gid=0 - Ingest Ticket: https://github.com/NCATSTranslator/Data-Ingest-Coordination-Working-Group/issues/23