Skip to content

EBI Gene2Phenotype Reference Ingest Guide

Source Information

InfoRes ID: infores:ebi-gene2phenotype

Description: EBI's Gene2Phenotype dataset contains high-quality gene-disease associations curated by UK disease domain experts and consultant clinical geneticists. It integrates data on genes, their variants, and related disorders. It is constructed by experts reviewing published literature, and it is primarily an inclusion list to allow targeted filtering of genome-wide data for diagnostic purposes. Each entry associates a gene with a disease, including a confidence level, allelic requirement and molecular mechanism.

Citations: - https://doi.org/10.1186/s13073-024-01398-1 - https://www.nature.com/articles/s41467-019-10016-3

Data Access Locations: - Latest data is provided at https://www.ebi.ac.uk/gene2phenotype/download (downloads created on-the-fly) - Archived static releases provided on the FTP site at https://ftp.ebi.ac.uk/pub/databases/gene2phenotype/G2P_data_downloads/

Data Provision Mechanisms: file_download

Data Formats: csv

Data Versioning and Releases: Releases cut and archived roughly every 1-2 months. On-the-fly downloads: creation/download date are the same and can be used for versioning. Note that the date in the filename may differ from the date in your timezone. Static releases: creation date (shown on FTP site and in folder/file names) can be used for versioning

Ingest Information

Ingest Categories: primary_knowledge_provider

Utility: EBI G2P associations are useful as edges in support paths for MVP1 ('what may treat disease X'), and in Pathfinder queries.

Relevant Files

File Name Location Description
G2P_all_[date].csv https://www.ebi.ac.uk/gene2phenotype/api/panel/all/download/ Associations from all panels (disease categories)

Included Content

File Name Included Records Fields Used
G2P_all_[date].csv Records where 'confidence' value is 'definitive', 'strong', or 'moderate' g2p id, hgnc id, disease mim, disease MONDO, allelic requirement, confidence, molecular mechanism, publications, date of last review

Filtered Content

File Name Filtered Records Rationale
G2P_all_[date].csv Records where 'confidence' value is 'limited', 'disputed', or 'refuted' Evidence level not sufficient for inclusion
G2P_all_[date].csv Records with no values in both 'disease mim' and 'disease MONDO' columns No IDs to use for disease nodes
G2P_all_[date].csv Records with NodeNorm mapping failures for the node IDs Failed normalization means that the node would not be connected to other data/nodes in Translator graphs

Future Content Considerations

edge_content: Revisit exclusion of 'disputed' and/or 'refuted' records once Translator can model/handle negation better

edge_property_content: Lots of additional edge-level information that we could include in future iterations: 'confidence' level values when we improve/refactor modeling of confidence in Biolink, variant information ('variant consequence', 'variant types' columns). The values map to SO terms. Rich evidence and provenance metadata provided by the source (e.g. type of experiments/methods used to determine the molecular mechanism, and supporting publications).

Target Information

Target InfoRes ID: infores:translator-ebi-gene2phenotype-kgx

Edge Types

Subject Categories Predicate Object Categories Knowledge Level Agent Type UI Explanation
biolink:Gene biolink:Disease knowledge_assertion manual_agent EBI G2P curators manually determined through the evaluation of different types of evidence that variants of this gene of the indicated form (e.g. loss of function, gain of function, dominant negative) play a causal role in this disease.

Node Types

Node Category Source Identifier Types Additional Notes
biolink:Gene HGNC
biolink:Disease OMIM, orphanet, MONDO 'disease mim' column is source of OMIM and orphanet IDs. MONDO IDs from 'disease MONDO' column are only used if row doesn't have a value in 'disease mim' column

Future Modeling Considerations

qualifiers: May want to revisit how we handle the 'molecular mechanism' and 'variant types' columns VS the biolink-model qualifier options

qualifiers: Revisit modeling of allelic_requirement (uses a regex pattern to match HP id syntax now, rather than an enumerated list of permissible values)

Provenance Information

Contributors: - Colleen Xu: code author, data modeling - Andrew Su: domain expertise - Sierra Moxon: domain expertise - Matthew Brush: data modeling, domain expertise

Artifacts: - Github Ticket on confidence 'limited' value: https://github.com/biolink/biolink-model/issues/1581 - PR on biolink allelic_requirement: https://github.com/biolink/biolink-model/pull/1576