Skip to content

Jensen Lab DISEASES Database Reference Ingest Guide

Source Information

InfoRes ID: infores:diseases

Description: The DISEASES database is a web resource that integrates knowledge on gene-disease associations. It generates de novo associations through automated text mining, and aggregates associations from external sources of manually curated knowledge and GWAS-based study results. The associations are assigned a confidence score to facilitate comparisons across data types and sources.

Citations: - https://doi.org/10.1093/database/baac019 - https://www.sciencedirect.com/science/article/pii/S1046202314003831

Terms of Use: CC BY 4.0

Data Access Locations: - https://diseases.jensenlab.org/Downloads

Data Provision Mechanisms: file_download

Data Formats: tsv

Data Versioning and Releases: Updated weekly (ref: resource's About page, paper), perhaps every weekend (paper: The corpus in DISEASES is updated every weekend). Website offers only download of latest version. It does not include a version or creation date for this download. Old, versioned releases archived at https://figshare.com/authors/Lars_Juhl_Jensen/96428

Ingest Information

Ingest Categories: primary_knowledge_provider, aggregation_provider

Utility: DISEASES contains gene-disease associations from unique sources, including their own text-mining pipeline and external human-curated resources that are hard to access or parse (MedlinePlus, AmyCo). These associations could be used in MVP1 (may treat disease X) or Pathfinder queries.

Scope: This ingest covers text-mined co-occurrence associations, and manually curated associations from MedlinePlus and AmyCo. Content aggregated from UniProt is not ingested. Experiment-based associations from TIGA data are not ingested (we will find a direct source of GWAS-based associations - TIGA and / or something else).

Relevant Files

File Name Location Description
human_disease_textmining_filtered.tsv https://diseases.jensenlab.org/Downloads Text mined associations, filtered to contain only the non-redundant associations that are shown within the web interface when querying for a gene
human_disease_knowledge_filtered.tsv https://diseases.jensenlab.org/Downloads Curated associations, filtered to contain only the non-redundant associations that are shown within the web interface when querying for a gene

Included Content

File Name Included Records Fields Used
human_disease_textmining_filtered.tsv All G-D association records generated by their text-mining tool gene_id, disease_id, z_score, confidence_score, url
human_disease_knowledge_filtered.tsv G-D association records aggregated from MedlinePlus and AmyCo sources gene_id, disease_id, source_db, confidence_score

Filtered Content

File Name Filtered Records Rationale
human_disease_textmining_filtered.tsv Records with no ENSP ID in gene ID column or no DOID in disease ID column Need node IDs that are in NodeNorm's scope. Other values are non-ID strings or IDs that wouldn't be resolved by NodeNorm (AmyCo).
human_disease_textmining_filtered.tsv Records that had NodeNorm mapping failure on gene or disease ID Need node IDs that NodeNorm successfully maps to entities.
human_disease_knowledge_filtered.tsv G-D association records aggregated from UniProt Questionable quality and completeness of Uniprot data in DISEASES - best to get this content directly from UniProt.
human_disease_knowledge_filtered.tsv Complete duplicates Only need 1 copy of each unique record
human_disease_knowledge_filtered.tsv Records with no ENSP ID in gene ID column or no DOID in disease ID column Need node IDs that are in NodeNorm's scope. Other values are non-ID strings or IDs that wouldn't be resolved by NodeNorm (AmyCo).
human_disease_knowledge_filtered.tsv Records that had NodeNorm mapping failure on gene or disease ID Need node IDs that NodeNorm successfully maps to entities.

Future Content Considerations

edge_content: Consider filtering some of the lower scoring text-mined associations if we can define a threshold/cutoff - Relevant files: human_disease_textmining_filtered.tsv

Target Information

Target InfoRes ID: infores:translator-jensen-diseases-kgx

Edge Types

Subject Categories Predicate Object Categories Knowledge Level Agent Type UI Explanation
biolink:Gene, biolink:Protein biolink:occurs_together_in_literature_with biolink:Disease statistical_association text_mining_agent The DISEASES text-mining method generates associations based on statistically significant co-occurrence of gene and disease concepts in the literature - which is consistent with the definition of the Biolink occurs_together_in_literature_with predicate.
biolink:Gene, biolink:Protein biolink:associated_with biolink:Disease knowledge_assertion manual_agent DISEASES does not report the types of gene-disease relationships that it aggregates from curated sources, so the Biolink associated_with predicate is the most precise predicate we are able to use here.

Node Types

Node Category Source Identifier Types Additional Notes
biolink:Gene ENSEMBL Source uses the ENSP (protein) identifiers from Ensembl
biolink:Protein ENSEMBL Source uses the ENSP (protein) identifiers from Ensembl
biolink:Disease DOID

Future Modeling Considerations

predicates: Revisit use of associated_with predicate for curated edges after we refactor the associated_with and/or gene-disease-relationship branches of the Biolink predicate hierarchy (if we reserve this predicate for statistically-based relationships, we may need to use related_to)

edge_properties: Revisit modeling of confidence score/levels and z-score if/when we refactor these parts of the Biolink Model

Provenance Information

Contributors: - Colleen Xu - code author, data modeling - Andrew Su - code support, domain expertise - Matthew Brush - data modeling, domain expertise

Artifacts: - Github Ticket: https://github.com/NCATSTranslator/Data-Ingest-Coordination-Working-Group/issues/13