CTD Resource Ingest Guide
Source Information
InfoRes ID: infores:ctd
Description: CTD is a robust, publicly available database that aims to advance understanding about how environmental exposures affect human health. It provides knowledge, manually curated from the literature, about chemicals and their relationship to other biological entities: chemical to gene/protein interactions plus chemical to disease and gene to disease relationships. These data are integrated with functional and pathway data to aid in the development of hypotheses about the mechanisms underlying environmentally influenced diseases. It also generates novel inferences by further analyzing the knowledge they curate/create - based on statistically significant connections with intermediate concept (e.g. Chemical X associated with Disease Y based on shared associations with a common set of genes).
Citations: - Davis AP, Wiegers TC, Johnson RJ, Sciaky D, Wiegers J, Mattingly CJ Comparative Toxicogenomics Database (CTD). update 2023. Nucleic Acids Res. 2022 Sep 28
Data Access Locations: - CTD Bulk Downloads - http://ctdbase.org/downloads/ (this page includes file sizes and simple data dictionaries for each download) - CTD Catalog - https://ctdbase.org/reports/ (a simple list of files, reports the number of rows in each file)
Data Provision Mechanisms: file_download
Data Formats: tsv, csv, obo, xml
Data Versioning and Releases: No consistent cadence for releases, but on average there are 1-2 releases each month. Versioning is based on the month and year of the release Releases page / change log: https://ctdbase.org/about/changes/ Latest status page: https://ctdbase.org/about/dataStatus.go
Ingest Information
Ingest Categories: primary_knowledge_provider
Utility: CTD is a rich source of manually curated chemical associations to other biological entities which are an important type of edge for Translator query and reasoning use cases, including treatment predictions, chemical-gene regulation predictions, and pathfinder queries. It is one of the few sources that focus on non-drug chemicals, e.g. environmental stressors, and how these are related to diseases, biological processes, and genes.
Scope: This initial ingest of CTD covers curated Chemical to Disease associations that report therapeutic and marker/mechanism relationships, and inferred statistical associations generated by CTD. Additional types of Chemical associations will be added in future updates to the ingest.
Relevant Files
File Name | Location | Description |
---|---|---|
CTD_chemicals_diseases.tsv.gz | http://ctdbase.org/downloads/ | Manually curated and computationally inferred associations between chemicals and diseases |
CTD_exposure_events.tsv.gz | http://ctdbase.org/downloads/ | Descriptions of statistical studies of how exposure to chemicals affects a particular population, with some records providing outcomes |
Included Content
File Name | Included Records | Fields Used |
---|---|---|
CTD_chemicals_diseases.tsv.gz | Curated therapeutic and marker/mechanism associations (records where a "DirectEvidence" value is populated with type "T" or "M"), as well as inferred associtionas (records lacking a value in the DirectEvidence column) | ChemicalName, ChemicalID, CasRN, DiseaseName, DiseaseID, DirectEvidence, InferenceGeneSymbol, InferenceScore, OmimIDs, PubMedIDs |
Filtered Content
File Name | Filtered Records | Rationale |
---|---|---|
CTD_chemicals_diseases.tsv.gz | None | Currently taking all records with no publication count or inference score cutoffs - but these may be added in future iterations |
Future Content Considerations
edge_content: Consider adding some threshold / cutoff to remove lower quality/confidence inferences (e.g. based on shared gene count, publication count, or inference score. At present we include even inferences based on a single shared gene/pub - which is not really meaningful. - Relevant files: CTD_chemicals_diseases.tsv.gz
edge_content: Consider ingesting additional chemical-disease edges reporting statistical correlations from environmental exposure studies from CTD_exposure_events.tsv.gz. This is a unique/novel source for this kind of knowledge, but there is not a lot of data here, and utility is not yet clear. - Relevant files: CTD_exposure_events.tsv.gz
edge_content: While the current ingest includes only Chemical-Disease Associations, future iterations will include additional types of associations between Chemicals and GO Terms, Molecular Phenotypes, Genes, etc. See the Ingest Survey table linked below for more details.
node_property_content: Molepro ingested chemical properties in its previous ingests - which we will likely bring in at some point.
edge_property_content: Consider an edge property that reports the list of shared genes supporting C-D inferred associations. - Relevant files: CTD_chemicals_diseases.tsv.gz
Additional Notes: none
Target Information
Target InfoRes ID: infores:translator-ctd-kgx
Edge Types
Subject Categories | Predicate | Object Categories | Knowledge Level | Agent Type | UI Explanation |
---|---|---|---|---|---|
biolink:ChemicalEntity | biolink:DiseaseOrPhenotypicFeature | knowledge_assertion | manual_agent | CTD Chemical-Disease records with a "T" (therapeutic) DirectEvidence code indicate the chemical to be a "potential" treatment in virtue of its clinical use or study - which maps best to the Biolink predicate 'treats_or_applied_or_studied_to_treat'. | |
biolink:ChemicalEntity | biolink:DiseaseOrPhenotypicFeature | knowledge_assertion | manual_agent | CTD Chemical-Disease records with a DirectEvidence code of "M" (marker/mechanism) indicate that the chemical is manually flagged as a marker or contributing factor for a condition. This implies that at minimum there is correlation between the presence of the chemical and condition, for which we use the Biolink 'correlated_with' predicate. | |
biolink:ChemicalEntity | biolink:DiseaseOrPhenotypicFeature | statistical_association | data_analysis_pipeline | CTD Chemical-Disease records with an inference score have a statistically significant number of shared gene associations that suggest a biological relationship may exist. The statistical basis of this general inferred relationship is best reported using the Biolink 'assocaited_with' predicate. |
Node Types
Node Category | Source Identifier Types | Additional Notes |
---|---|---|
biolink:ChemicalEntity | MeSH | Majority are Biolink SmallMolecules |
biolink:DiseaseOrPhenotypicFeature | MeSH |
Future Modeling Considerations
edge_properties: Revisit use of 'has_confidence_score' edge property if/when we refactor this part of the Biolink Model.
predicates: Revisit 'correlated_with' and 'treats_or_studied_or_applied_to_treat' predicates if/when we refactor modeling or conventions here.
Additional Notes: CTD_chemicals_diseases.tsv. data includes one row per curated 'T', or 'M' association with pub reference(s), plus one row per shared gene association with pub reference(s), and inference scores. Separate edges will be created for each type of association reported between a chemical and a given disease, according to the mappings described above. All "shared gene" rows in the source data file for a given C-D pair will be aggregated into a single 'associated_with' edge that reports an associated_with relationship with the inference score as an edge property (and possibly the list of shared genes). This means that for a given C-D pair in the CTD file, there may be 1, 2, or 3 separate edges created in the Translator graph.
Provenance Information
Contributors: - Kevin Schaper: code author - Evan Morris: code support - Sierra Moxon: code support - Vlado Dancik: code support, domain expertise - Matthew Brush: data modeling, domain expertise
Artifacts: - Ingest Survey: https://docs.google.com/spreadsheets/d/1R9z-vywupNrD_3ywuOt_sntcTrNlGmhiUWDXUdkPVpM/edit?gid=0#gid=0 - Ingest Ticket: https://github.com/NCATSTranslator/Data-Ingest-Coordination-Working-Group/issues/23