Jensen Lab DISEASES Database Reference Ingest Guide
Source Information
InfoRes ID: infores:diseases
Description: The DISEASES database is a web resource that integrates knowledge on gene-disease associations. It generates de novo associations through automated text mining, and aggregates associations from external sources of manually curated knowledge and GWAS-based study results. The associations are assigned a confidence score to facilitate comparisons across data types and sources.
Citations: - https://doi.org/10.1093/database/baac019 - https://www.sciencedirect.com/science/article/pii/S1046202314003831
Terms of Use: CC BY 4.0
Data Access Locations: - https://diseases.jensenlab.org/Downloads
Data Provision Mechanisms: file_download
Data Formats: tsv
Data Versioning and Releases: Updated weekly (ref: resource's About page, paper), perhaps every weekend (paper: The corpus in DISEASES is updated every weekend). Website offers only download of latest version. It does not include a version or creation date for this download. Old, versioned releases archived at https://figshare.com/authors/Lars_Juhl_Jensen/96428
Ingest Information
Ingest Categories: primary_knowledge_provider, aggregation_provider
Utility: DISEASES contains gene-disease associations from unique sources, including their own text-mining pipeline and external human-curated resources that are hard to access or parse (MedlinePlus, AmyCo). These associations could be used in MVP1 (may treat disease X) or Pathfinder queries.
Scope: This ingest covers text-mined co-occurrence associations, and manually curated associations from MedlinePlus and AmyCo. Content aggregated from UniProt is not ingested. Experiment-based associations from TIGA data are not ingested (we will find a direct source of GWAS-based associations - TIGA and / or something else).
Relevant Files
File Name | Location | Description |
---|---|---|
human_disease_textmining_filtered.tsv | https://diseases.jensenlab.org/Downloads | Text mined associations, filtered to contain only the non-redundant associations that are shown within the web interface when querying for a gene |
human_disease_knowledge_filtered.tsv | https://diseases.jensenlab.org/Downloads | Curated associations, filtered to contain only the non-redundant associations that are shown within the web interface when querying for a gene |
Included Content
File Name | Included Records | Fields Used |
---|---|---|
human_disease_textmining_filtered.tsv | All G-D association records generated by their text-mining tool | gene_id, disease_id, z_score, confidence_score, url |
human_disease_knowledge_filtered.tsv | G-D association records aggregated from MedlinePlus and AmyCo sources | gene_id, disease_id, source_db, confidence_score |
Filtered Content
File Name | Filtered Records | Rationale |
---|---|---|
human_disease_textmining_filtered.tsv | Records with no ENSP ID in gene ID column or no DOID in disease ID column | Need node IDs that are in NodeNorm's scope. Other values are non-ID strings or IDs that wouldn't be resolved by NodeNorm (AmyCo). |
human_disease_textmining_filtered.tsv | Records that had NodeNorm mapping failure on gene or disease ID | Need node IDs that NodeNorm successfully maps to entities. |
human_disease_knowledge_filtered.tsv | G-D association records aggregated from UniProt | Questionable quality and completeness of Uniprot data in DISEASES - best to get this content directly from UniProt. |
human_disease_knowledge_filtered.tsv | Complete duplicates | Only need 1 copy of each unique record |
human_disease_knowledge_filtered.tsv | Records with no ENSP ID in gene ID column or no DOID in disease ID column | Need node IDs that are in NodeNorm's scope. Other values are non-ID strings or IDs that wouldn't be resolved by NodeNorm (AmyCo). |
human_disease_knowledge_filtered.tsv | Records that had NodeNorm mapping failure on gene or disease ID | Need node IDs that NodeNorm successfully maps to entities. |
Future Content Considerations
edge_content: Consider filtering some of the lower scoring text-mined associations if we can define a threshold/cutoff - Relevant files: human_disease_textmining_filtered.tsv
Target Information
Target InfoRes ID: infores:translator-jensen-diseases-kgx
Edge Types
Subject Categories | Predicate | Object Categories | Knowledge Level | Agent Type | UI Explanation |
---|---|---|---|---|---|
biolink:Gene, biolink:Protein | biolink:occurs_together_in_literature_with | biolink:Disease | statistical_association | text_mining_agent | The DISEASES text-mining method generates associations based on statistically significant co-occurrence of gene and disease concepts in the literature - which is consistent with the definition of the Biolink occurs_together_in_literature_with predicate. |
biolink:Gene, biolink:Protein | biolink:associated_with | biolink:Disease | knowledge_assertion | manual_agent | DISEASES does not report the types of gene-disease relationships that it aggregates from curated sources, so the Biolink associated_with predicate is the most precise predicate we are able to use here. |
Node Types
Node Category | Source Identifier Types | Additional Notes |
---|---|---|
biolink:Gene | ENSEMBL | Source uses the ENSP (protein) identifiers from Ensembl |
biolink:Protein | ENSEMBL | Source uses the ENSP (protein) identifiers from Ensembl |
biolink:Disease | DOID |
Future Modeling Considerations
predicates: Revisit use of associated_with predicate for curated edges after we refactor the associated_with and/or gene-disease-relationship branches of the Biolink predicate hierarchy (if we reserve this predicate for statistically-based relationships, we may need to use related_to)
edge_properties: Revisit modeling of confidence score/levels and z-score if/when we refactor these parts of the Biolink Model
Provenance Information
Contributors: - Colleen Xu - code author, data modeling - Andrew Su - code support, domain expertise - Matthew Brush - data modeling, domain expertise
Artifacts: - Github Ticket: https://github.com/NCATSTranslator/Data-Ingest-Coordination-Working-Group/issues/13