EBI Gene2Phenotype Reference Ingest Guide
Source Information
InfoRes ID: infores:ebi-gene2phenotype
Description: EBI's Gene2Phenotype dataset contains high-quality gene-disease associations curated by UK disease domain experts and consultant clinical geneticists. It integrates data on genes, their variants, and related disorders. It is constructed by experts reviewing published literature, and it is primarily an inclusion list to allow targeted filtering of genome-wide data for diagnostic purposes. Each entry associates a gene with a disease, including a confidence level, allelic requirement and molecular mechanism.
Citations: - https://doi.org/10.1186/s13073-024-01398-1 - https://www.nature.com/articles/s41467-019-10016-3
Data Access Locations: - Latest data is provided at https://www.ebi.ac.uk/gene2phenotype/download (downloads created on-the-fly) - Archived static releases provided on the FTP site at https://ftp.ebi.ac.uk/pub/databases/gene2phenotype/G2P_data_downloads/
Data Provision Mechanisms: file_download
Data Formats: csv
Data Versioning and Releases: Releases cut and archived roughly every 1-2 months. On-the-fly downloads: creation/download date are the same and can be used for versioning. Note that the date in the filename may differ from the date in your timezone. Static releases: creation date (shown on FTP site and in folder/file names) can be used for versioning
Ingest Information
Ingest Categories: primary_knowledge_provider
Utility: EBI G2P associations are useful as edges in support paths for MVP1 ('what may treat disease X'), and in Pathfinder queries.
Relevant Files
File Name | Location | Description |
---|---|---|
G2P_all_[date].csv | https://www.ebi.ac.uk/gene2phenotype/api/panel/all/download/ | Associations from all panels (disease categories) |
Included Content
File Name | Included Records | Fields Used |
---|---|---|
G2P_all_[date].csv | Records where 'confidence' value is 'definitive', 'strong', or 'moderate' | g2p id, hgnc id, disease mim, disease MONDO, allelic requirement, confidence, molecular mechanism, publications, date of last review |
Filtered Content
File Name | Filtered Records | Rationale |
---|---|---|
G2P_all_[date].csv | Records where 'confidence' value is 'limited', 'disputed', or 'refuted' | Evidence level not sufficient for inclusion |
G2P_all_[date].csv | Records with no values in both 'disease mim' and 'disease MONDO' columns | No IDs to use for disease nodes |
G2P_all_[date].csv | Records with NodeNorm mapping failures for the node IDs | Failed normalization means that the node would not be connected to other data/nodes in Translator graphs |
Future Content Considerations
edge_content: Revisit exclusion of 'disputed' and/or 'refuted' records once Translator can model/handle negation better
edge_property_content: Lots of additional edge-level information that we could include in future iterations: 'confidence' level values when we improve/refactor modeling of confidence in Biolink, variant information ('variant consequence', 'variant types' columns). The values map to SO terms. Rich evidence and provenance metadata provided by the source (e.g. type of experiments/methods used to determine the molecular mechanism, and supporting publications).
Target Information
Target InfoRes ID: infores:translator-ebi-gene2phenotype-kgx
Edge Types
Subject Categories | Predicate | Object Categories | Knowledge Level | Agent Type | UI Explanation |
---|---|---|---|---|---|
biolink:Gene | biolink:Disease | knowledge_assertion | manual_agent | EBI G2P curators manually determined through the evaluation of different types of evidence that variants of this gene of the indicated form (e.g. loss of function, gain of function, dominant negative) play a causal role in this disease. |
Node Types
Node Category | Source Identifier Types | Additional Notes |
---|---|---|
biolink:Gene | HGNC | |
biolink:Disease | OMIM, orphanet, MONDO | 'disease mim' column is source of OMIM and orphanet IDs. MONDO IDs from 'disease MONDO' column are only used if row doesn't have a value in 'disease mim' column |
Future Modeling Considerations
qualifiers: May want to revisit how we handle the 'molecular mechanism' and 'variant types' columns VS the biolink-model qualifier options
qualifiers: Revisit modeling of allelic_requirement (uses a regex pattern to match HP id syntax now, rather than an enumerated list of permissible values)
Provenance Information
Contributors: - Colleen Xu: code author, data modeling - Andrew Su: domain expertise - Sierra Moxon: domain expertise - Matthew Brush: data modeling, domain expertise
Artifacts: - Github Ticket on confidence 'limited' value: https://github.com/biolink/biolink-model/issues/1581 - PR on biolink allelic_requirement: https://github.com/biolink/biolink-model/pull/1576