Source#

A Source can be implemented for any file, local, and/or remote store that can contains a graph. A Source is responsible for reading nodes and edges from the graph.

A source must subclass kgx.source.source.Source class and must implement the following methods:

  • parse

  • read_nodes

  • read_edges

parse method

  • Responsible for parsing a graph from a file/store

  • Must return a generator that iterates over list of node and edge records from the graph

read_nodes method

  • Responsible for reading nodes from the file/store

  • Must return a generator that iterates over list of node records

  • Each node record must be a 2-tuple (node_id, node_data) where,

    • node_id is the node CURIE

    • node_data is a dictionary that represents the node properties

read_edges method

  • Responsible for reading edges from the file/store

  • Must return a generator that iterates over list of edge records

  • Each edge record must be a 4-tuple (subject_id, object_id, edge_key, edge_data) where,

    • subject_id is the subject node CURIE

    • object_id is the object node CURIE

    • edge_key is the unique key for the edge

    • edge_data is a dictionary that represents the edge properties

kgx.source.source#

Base class for all Sources in KGX.

class kgx.source.source.Source(owner)[source]#

Bases: object

A Source is responsible for reading data as records from a store where the store is a file or a database.

check_edge_filter(edge: Dict) bool[source]#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool[source]#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()[source]#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

get_infores_catalog() Dict[str, str][source]#

Return the InfoRes Context of the source

set_edge_filter(key: str, value: set) None[source]#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None[source]#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)[source]#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None[source]#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None[source]#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_provenance(node_data)[source]#

Set a specific node provenance value.

set_prefix_map(m: Dict) None[source]#

Update default prefix map.

Parameters:

m (Dict) – A dictionary with prefix to IRI mappings

set_provenance_map(kwargs)[source]#

Set up a provenance (Knowledge Source to InfoRes) map

validate_edge(edge: Dict) Optional[Dict][source]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict][source]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict

kgx.source.graph_source#

GraphSource is responsible for reading from an instance of kgx.graph.base_graph.BaseGraph and must use only the methods exposed by BaseGraph to access the graph.

class kgx.source.graph_source.GraphSource(owner)[source]#

Bases: Source

GraphSource is responsible for reading data as records from an in memory graph representation.

The underlying store must be an instance of kgx.graph.base_graph.BaseGraph

check_edge_filter(edge: Dict) bool#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

get_infores_catalog() Dict[str, str]#

Return the InfoRes Context of the source

parse(graph: BaseGraph, **kwargs: Any) Generator[source]#

This method reads from a graph and yields records.

Parameters:
Returns:

A generator for node and edge records read from the graph

Return type:

Generator

read_edges() Generator[source]#

Read edges as records from the graph.

Returns:

A generator for edges

Return type:

Generator

read_nodes() Generator[source]#

Read nodes as records from the graph.

Returns:

A generator for nodes

Return type:

Generator

set_edge_filter(key: str, value: set) None#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_provenance(node_data)#

Set a specific node provenance value.

set_prefix_map(m: Dict) None#

Update default prefix map.

Parameters:

m (Dict) – A dictionary with prefix to IRI mappings

set_provenance_map(kwargs)#

Set up a provenance (Knowledge Source to InfoRes) map

validate_edge(edge: Dict) Optional[Dict]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict

kgx.source.tsv_source#

TsvSource is responsible for reading from KGX formatted CSV or TSV using Pandas where every flat file is treated as a Pandas DataFrame and from which data are read in chunks.

KGX expects two separate files - one for nodes and another for edges.

class kgx.source.tsv_source.TsvSource(owner)[source]#

Bases: Source

TsvSource is responsible for reading data as records from a TSV/CSV.

check_edge_filter(edge: Dict) bool#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

get_infores_catalog() Dict[str, str]#

Return the InfoRes Context of the source

parse(filename: str, format: str, compression: Optional[str] = None, **kwargs: Any) Generator[source]#

This method reads from a TSV/CSV and yields records.

Parameters:
  • filename (str) – The filename to parse

  • format (str) – The format (tsv, csv)

  • compression (Optional[str]) – The compression type (tar, tar.gz)

  • kwargs (Any) – Any additional arguments

Returns:

A generator for node and edge records

Return type:

Generator

read_edge(edge: Dict) Optional[Tuple][source]#

Load an edge into an instance of BaseGraph.

Parameters:

edge (Dict) – An edge

Returns:

A tuple that contains subject id, object id, edge key, and edge data

Return type:

Optional[Tuple]

read_edges(df: DataFrame) Generator[source]#

Load edges from pandas.DataFrame into an instance of BaseGraph.

Parameters:

df (pandas.DataFrame) – Dataframe containing records that represent edges

Returns:

A generator for edge records

Return type:

Generator

read_node(node: Dict) Optional[Tuple[str, Dict]][source]#

Prepare a node.

Parameters:

node (Dict) – A node

Returns:

A tuple that contains node id and node data

Return type:

Optional[Tuple[str, Dict]]

read_nodes(df: DataFrame) Generator[source]#

Read records from pandas.DataFrame and yield records.

Parameters:

df (pandas.DataFrame) – Dataframe containing records that represent nodes

Returns:

A generator for node records

Return type:

Generator

set_edge_filter(key: str, value: set) None#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_provenance(node_data)#

Set a specific node provenance value.

set_prefix_map(m: Dict) None[source]#

Add or override default prefix to IRI map.

Parameters:

m (Dict) – Prefix to IRI map

set_provenance_map(kwargs)#

Set up a provenance (Knowledge Source to InfoRes) map

set_reverse_prefix_map(m: Dict) None[source]#

Add or override default IRI to prefix map.

Parameters:

m (Dict) – IRI to prefix map

validate_edge(edge: Dict) Optional[Dict]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict

kgx.source.json_source#

JsonSource is responsible for reading data from a KGX formatted JSON using the ijson library, which allows for streaming data from the file.

class kgx.source.json_source.JsonSource(owner)[source]#

Bases: TsvSource

JsonSource is responsible for reading data as records from a JSON.

check_edge_filter(edge: Dict) bool#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

get_infores_catalog() Dict[str, str]#

Return the InfoRes Context of the source

parse(filename: str, format: str = 'json', compression: Optional[str] = None, **kwargs: Any) Generator[source]#

This method reads from a JSON and yields records.

Parameters:
  • filename (str) – The filename to parse

  • format (str) – The format (json)

  • compression (Optional[str]) – The compression type (gz)

  • kwargs (Any) – Any additional arguments

Returns:

A generator for node and edge records read from the file

Return type:

Generator

read_edge(edge: Dict) Optional[Tuple]#

Load an edge into an instance of BaseGraph.

Parameters:

edge (Dict) – An edge

Returns:

A tuple that contains subject id, object id, edge key, and edge data

Return type:

Optional[Tuple]

read_edges(filename: str) Generator[source]#

Read edge records from a JSON.

Parameters:

filename (str) – The filename to read from

Returns:

A generator for edge records

Return type:

Generator

read_node(node: Dict) Optional[Tuple[str, Dict]]#

Prepare a node.

Parameters:

node (Dict) – A node

Returns:

A tuple that contains node id and node data

Return type:

Optional[Tuple[str, Dict]]

read_nodes(filename: str) Generator[source]#

Read node records from a JSON.

Parameters:

filename (str) – The filename to read from

Returns:

A generator for node records

Return type:

Generator

set_edge_filter(key: str, value: set) None#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_provenance(node_data)#

Set a specific node provenance value.

set_prefix_map(m: Dict) None#

Add or override default prefix to IRI map.

Parameters:

m (Dict) – Prefix to IRI map

set_provenance_map(kwargs)#

Set up a provenance (Knowledge Source to InfoRes) map

set_reverse_prefix_map(m: Dict) None#

Add or override default IRI to prefix map.

Parameters:

m (Dict) – IRI to prefix map

validate_edge(edge: Dict) Optional[Dict]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict

kgx.source.jsonl_source#

JsonlSource is responsible for reading data from a KGX formatted JSON Lines using the jsonlines library.

KGX expects two separate JSON Lines files - one for nodes and another for edges.

KGX JSON Lines Format Specification#

The JSON Lines format provides an efficient way to represent KGX data where each line contains a single JSON object representing either a node or an edge. This format is ideal for streaming large graphs and combines the advantages of JSON with line-oriented processing.

File Structure#

  • {filename}_nodes.jsonl: Contains one node per line, each as a complete JSON object

  • {filename}_edges.jsonl: Contains one edge per line, each as a complete JSON object

Node Record Format#

Required Properties#

  • id (string): A CURIE that uniquely identifies the node in the graph

  • category (array of strings): List of Biolink categories for the node, from the NamedThing hierarchy

Common Optional Properties#

  • name (string): Human-readable name of the entity

  • description (string): Human-readable description of the entity

  • provided_by (array of strings): List of sources that provided this node

  • xref (array of strings): List of database cross-references as CURIEs

  • synonym (array of strings): List of alternative names for the entity

Edge Record Format#

Required Properties#

  • subject (string): CURIE of the source node

  • predicate (string): Biolink predicate representing the relationship type

  • object (string): CURIE of the target node

  • knowledge_level (string): Level of knowledge representation (observation, assertion, concept, statement) according to Biolink Model

  • agent_type (string): Autonomous agents for edges (informational, computational, biochemical, biological) according to Biolink Model

Common Optional Properties#

  • id (string): Unique identifier for the edge, often a UUID

  • relation (string): Relation CURIE from a formal relation ontology (e.g., RO)

  • category (array of strings): List of Biolink association categories

  • knowledge_source (array of strings): Sources of knowledge (deprecated: provided_by)

  • primary_knowledge_source (array of strings): Primary knowledge sources

  • aggregator_knowledge_source (array of strings): Knowledge aggregator sources

  • publications (array of strings): List of publication CURIEs supporting the edge

Examples#

Node Example (nodes.jsonl):

Each line in a nodes.jsonl file represents a complete node record. Here are examples of different node types:

{
  "id": "HGNC:11603",
  "name": "TBX4",
  "category": ["biolink:Gene"]
},
{
  "id": "MONDO:0005002",
  "name": "chronic obstructive pulmonary disease",
  "category": ["biolink:Disease"]
},
{
  "id": "CHEBI:15365",
  "name": "acetaminophen",
  "category": ["biolink:SmallMolecule", "biolink:ChemicalEntity"]
}

In the actual jsonlines file, each record would be on a single line without comments and formatting:

{"id":"HGNC:11603","name":"TBX4","category":["biolink:Gene"]}
{"id":"MONDO:0005002","name":"chronic obstructive pulmonary disease","category":["biolink:Disease"]}
{"id":"CHEBI:15365","name":"acetaminophen","category":["biolink:SmallMolecule","biolink:ChemicalEntity"]}

Edge Example (edges.jsonl):

Each line in a jsonlines file represents a complete edge record. Here are examples of different edge types:

{
  "id": "a8575c4e-61a6-428a-bf09-fcb3e8d1644d",
  "subject": "HGNC:11603",
  "object": "MONDO:0005002",
  "predicate": "biolink:related_to",
  "relation": "RO:0003304",
  "knowledge_level": "assertion",
  "agent_type": "computational"
},
{
  "id": "urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e",
  "subject": "HGNC:11603",
  "predicate": "biolink:contributes_to",
  "object": "MONDO:0005002",
  "relation": "RO:0003304",
  "category": ["biolink:GeneToDiseaseAssociation"],
  "primary_knowledge_source": ["infores:gwas-catalog"],
  "publications": ["PMID:26634245", "PMID:26634244"],
  "knowledge_level": "observation",
  "agent_type": "biological"
},
{
  "id": "c7d632b4-6708-4296-9cfe-44bc586d32c8",
  "subject": "CHEBI:15365",
  "predicate": "biolink:affects",
  "object": "GO:0006915",
  "relation": "RO:0002434",
  "category": ["biolink:ChemicalToProcessAssociation"],
  "primary_knowledge_source": ["infores:monarchinitiative"],
  "aggregator_knowledge_source": ["infores:biolink-api"],
  "publications": ["PMID:12345678"],
  "knowledge_level": "assertion",
  "agent_type": "computational"
}

In the actual jsonlines file, each record would be on a single line without comments and formatting:

{"id":"a8575c4e-61a6-428a-bf09-fcb3e8d1644d","subject":"HGNC:11603","object":"MONDO:0005002","predicate":"biolink:related_to","relation":"RO:0003304","knowledge_level":"assertion","agent_type":"computational"}
{"id":"urn:uuid:5b06e86f-d768-4cd9-ac27-abe31e95ab1e","subject":"HGNC:11603","predicate":"biolink:contributes_to","object":"MONDO:0005002","relation":"RO:0003304","category":["biolink:GeneToDiseaseAssociation"],"primary_knowledge_source":["infores:gwas-catalog"],"publications":["PMID:26634245","PMID:26634244"],"knowledge_level":"observation","agent_type":"biological"}

Reading JSON Lines with KGX#

When using KGX to read JSON Lines files, the library will:

  1. Parse each line as a complete JSON object

  2. Validate required fields are present

  3. Convert the data into the internal graph representation

  4. Handle arrays properly as native Python lists (unlike TSV where lists are often pipe-delimited strings)

class kgx.source.jsonl_source.JsonlSource(owner)[source]#

Bases: JsonSource

JsonlSource is responsible for reading data as records from JSON Lines.

check_edge_filter(edge: Dict) bool#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

get_infores_catalog() Dict[str, str]#

Return the InfoRes Context of the source

parse(filename: str, format: str = 'jsonl', compression: Optional[str] = None, **kwargs: Any) Generator[source]#

This method reads from JSON Lines and yields records.

Parameters:
  • filename (str) – The filename to parse

  • format (str) – The format (json)

  • compression (Optional[str]) – The compression type (gz)

  • kwargs (Any) – Any additional arguments

Returns:

A generator for records

Return type:

Generator

read_edge(edge: Dict) Optional[Tuple]#

Load an edge into an instance of BaseGraph.

Parameters:

edge (Dict) – An edge

Returns:

A tuple that contains subject id, object id, edge key, and edge data

Return type:

Optional[Tuple]

read_edges(filename: str) Generator#

Read edge records from a JSON.

Parameters:

filename (str) – The filename to read from

Returns:

A generator for edge records

Return type:

Generator

read_node(node: Dict) Optional[Tuple[str, Dict]]#

Prepare a node.

Parameters:

node (Dict) – A node

Returns:

A tuple that contains node id and node data

Return type:

Optional[Tuple[str, Dict]]

read_nodes(filename: str) Generator#

Read node records from a JSON.

Parameters:

filename (str) – The filename to read from

Returns:

A generator for node records

Return type:

Generator

set_edge_filter(key: str, value: set) None#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_provenance(node_data)#

Set a specific node provenance value.

set_prefix_map(m: Dict) None#

Add or override default prefix to IRI map.

Parameters:

m (Dict) – Prefix to IRI map

set_provenance_map(kwargs)#

Set up a provenance (Knowledge Source to InfoRes) map

set_reverse_prefix_map(m: Dict) None#

Add or override default IRI to prefix map.

Parameters:

m (Dict) – IRI to prefix map

validate_edge(edge: Dict) Optional[Dict]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict

kgx.source.trapi_source#

TrapiSource is responsible for reading data from a Translator Reasoner API formatted JSON.

class kgx.source.trapi_source.TrapiSource(owner)[source]#

Bases: JsonSource

TrapiSource is responsible for reading data as records from a TRAPI (Translator Reasoner API) compliant JSON.

This class handles TRAPI 1.5.0 specification.

check_edge_filter(edge: Dict) bool#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

get_infores_catalog() Dict[str, str]#

Return the InfoRes Context of the source

load_edge(edge: Dict) Tuple[str, str, str, Dict][source]#

Load a TRAPI edge into KGX format

Parameters:

edge (Dict) – A TRAPI edge

Returns:

A tuple containing (subject_id, object_id, edge_id, edge_data) in KGX format

Return type:

Tuple[str, str, str, Dict]

load_node(node: Dict) Tuple[str, Dict][source]#

Load a TRAPI node into KGX format

Parameters:

node (Dict) – A TRAPI node

Returns:

A tuple containing (node_id, node_data) in KGX format

Return type:

Tuple[str, Dict]

parse(filename: str, format: str = 'json', compression: Optional[str] = None, **kwargs: Any) Generator[source]#

This method reads from a TRAPI JSON and yields KGX records.

Parameters:
  • filename (str) – The filename to parse

  • format (str) – The format (json or jsonl)

  • compression (Optional[str]) – The compression type (gz)

  • kwargs (Any) – Any additional arguments

Returns:

A generator for node and edge records

Return type:

Generator

read_edge(edge: Dict) Optional[Tuple]#

Load an edge into an instance of BaseGraph.

Parameters:

edge (Dict) – An edge

Returns:

A tuple that contains subject id, object id, edge key, and edge data

Return type:

Optional[Tuple]

read_edges(filename: str, compression: Optional[str] = None) Generator[source]#

Read edge records from a TRAPI JSON.

Parameters:
  • filename (str) – The filename to read from

  • compression (Optional[str]) – The compression type

Returns:

A generator for edge records

Return type:

Generator

read_edges_jsonl(filename: str, compression: Optional[str] = None) Generator[source]#

Read edge records from a TRAPI JSONL file.

Parameters:
  • filename (str) – The filename to read from

  • compression (Optional[str]) – The compression type

Returns:

A generator for edge records

Return type:

Generator

read_node(node: Dict) Optional[Tuple[str, Dict]]#

Prepare a node.

Parameters:

node (Dict) – A node

Returns:

A tuple that contains node id and node data

Return type:

Optional[Tuple[str, Dict]]

read_nodes(filename: str, compression: Optional[str] = None) Generator[source]#

Read node records from a TRAPI JSON.

Parameters:
  • filename (str) – The filename to read from

  • compression (Optional[str]) – The compression type

Returns:

A generator for node records

Return type:

Generator

read_nodes_jsonl(filename: str, compression: Optional[str] = None) Generator[source]#

Read node records from a TRAPI JSONL file.

Parameters:
  • filename (str) – The filename to read from

  • compression (Optional[str]) – The compression type

Returns:

A generator for node records

Return type:

Generator

set_edge_filter(key: str, value: set) None#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_provenance(node_data)#

Set a specific node provenance value.

set_prefix_map(m: Dict) None#

Add or override default prefix to IRI map.

Parameters:

m (Dict) – Prefix to IRI map

set_provenance_map(kwargs)#

Set up a provenance (Knowledge Source to InfoRes) map

set_reverse_prefix_map(m: Dict) None#

Add or override default IRI to prefix map.

Parameters:

m (Dict) – IRI to prefix map

validate_edge(edge: Dict) Optional[Dict]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict

kgx.source.obograph_source#

ObographSource is responsible for reading data from OBOGraphs in JSON.

class kgx.source.obograph_source.ObographSource(owner)[source]#

Bases: JsonSource

ObographSource is responsible for reading data as records from an OBO Graph JSON.

check_edge_filter(edge: Dict) bool#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

get_category(curie: str, node: dict) Optional[str][source]#

Get category for a given CURIE.

Parameters:
  • curie (str) – Curie for node

  • node (dict) – Node data

Returns:

Category for the given node CURIE.

Return type:

Optional[str]

get_infores_catalog() Dict[str, str]#

Return the InfoRes Context of the source

parse(filename: str, format: str = 'json', compression: Optional[str] = None, **kwargs: Any) Generator[source]#

This method reads from JSON and yields records.

Parameters:
  • filename (str) – The filename to parse

  • format (str) – The format (json)

  • compression (Optional[str]) – The compression type (gz)

  • kwargs (Any) – Any additional arguments

Returns:

A generator for records

Return type:

Generator

parse_meta(node: str, meta: Dict) Dict[source]#

Parse ‘meta’ field of a node.

Parameters:
  • node (str) – Node identifier

  • meta (Dict) – meta dictionary for the node

Returns:

A dictionary that contains ‘description’, ‘subsets’, ‘synonyms’, ‘xrefs’, a ‘deprecated’ flag and/or ‘equivalent_nodes’.

Return type:

Dict

read_edge(edge: Dict) Optional[Tuple][source]#

Read and parse an edge record.

Parameters:

edge (Dict) – The edge record

Returns:

The processed edge

Return type:

Dict

read_edges(filename: str, compression: Optional[str] = None) Generator[source]#

Read edge records from a JSON.

Parameters:
  • filename (str) – The filename to read from

  • compression (Optional[str]) – The compression type

Returns:

A generator for edge records

Return type:

Generator

read_node(node: Dict) Optional[Tuple[str, Dict]][source]#

Read and parse a node record.

Parameters:

node (Dict) – The node record

Returns:

The processed node

Return type:

Dict

read_nodes(filename: str, compression: Optional[str] = None) Generator[source]#

Read node records from a JSON.

Parameters:
  • filename (str) – The filename to read from

  • compression (Optional[str]) – The compression type

Returns:

A generator for node records

Return type:

Generator

set_edge_filter(key: str, value: set) None#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_provenance(node_data)#

Set a specific node provenance value.

set_prefix_map(m: Dict) None#

Add or override default prefix to IRI map.

Parameters:

m (Dict) – Prefix to IRI map

set_provenance_map(kwargs)#

Set up a provenance (Knowledge Source to InfoRes) map

set_reverse_prefix_map(m: Dict) None#

Add or override default IRI to prefix map.

Parameters:

m (Dict) – IRI to prefix map

validate_edge(edge: Dict) Optional[Dict]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict

kgx.source.sssom_source#

SssomSource is responsible for reading data from an SSSOM formatted files.

KGX Source for Simple Standard for Sharing Ontology Mappings (“SSSOM”)

class kgx.source.sssom_source.SssomSource(owner)[source]#

Bases: Source

SssomSource is responsible for reading data as records from an SSSOM file.

check_edge_filter(edge: Dict) bool#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

get_infores_catalog() Dict[str, str]#

Return the InfoRes Context of the source

load_edge(edge: Dict) Generator[source]#

Load an edge into an instance of BaseGraph

Parameters:

edge (Dict) – An edge

Returns:

A generator for node and edge records

Return type:

Generator

load_edges(df: DataFrame) Generator[source]#

Load edges from pandas.DataFrame into an instance of BaseGraph

Parameters:

df (pandas.DataFrame) – Dataframe containing records that represent edges

Returns:

A generator for edge records

Return type:

Generator

load_node(node_data: Dict) Optional[Tuple[str, Dict]][source]#

Load a node into an instance of BaseGraph

Parameters:

node_data (Dict) – A node

Returns:

A tuple that contains node id and node data

Return type:

Optional[Tuple[str, Dict]]

parse(filename: str, format: str, compression: Optional[str] = None, **kwargs: Any) Generator[source]#

Parse a SSSOM TSV

Parameters:
  • filename (str) – File to read from

  • format (str) – The input file format (tsv, by default)

  • compression (Optional[str]) – The compression (gz)

  • kwargs (Dict) – Any additional arguments

Returns:

A generator for node and edge records

Return type:

Generator

parse_header(filename: str, compression: Optional[str] = None) None[source]#

Parse metadata from SSSOM headers.

Parameters:
  • filename (str) – Filename to parse

  • compression (Optional[str]) – Compression type

set_edge_filter(key: str, value: set) None#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_provenance(node_data)#

Set a specific node provenance value.

set_prefix_map(m: Dict) None[source]#

Add or override default prefix to IRI map.

Parameters:

m (Dict) – Prefix to IRI map

set_provenance_map(kwargs)#

Set up a provenance (Knowledge Source to InfoRes) map

set_reverse_prefix_map(m: Dict) None[source]#

Add or override default IRI to prefix map.

Parameters:

m (Dict) – IRI to prefix map

validate_edge(edge: Dict) Optional[Dict]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict

kgx.source.neo_source#

NeoSource is responsible for reading data from a local or remote Neo4j instance.

class kgx.source.neo_source.NeoSource(owner)[source]#

Bases: Source

NeoSource is responsible for reading data as records from a Neo4j instance.

check_edge_filter(edge: Dict) bool#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

count(is_directed: bool = True) int[source]#

Get the total count of records to be fetched from the Neo4j database.

Parameters:

is_directed (bool) – Are edges directed or undirected. True, by default, since edges in most cases are directed.

Returns:

The total count of records

Return type:

int

static format_edge_filter(edge_filters: Dict, key: str, variable: Optional[str] = None, prefix: Optional[str] = None, op: Optional[str] = None) str[source]#

Get the value for edge filter as defined by key. This is used as a convenience method for generating cypher queries.

Parameters:
  • edge_filters (Dict) – All edge filters

  • key (str) – Name of the edge filter

  • variable (Optional[str]) – Variable binding for cypher query

  • prefix (Optional[str]) – Prefix for the cypher

  • op (Optional[str]) – The operator

Returns:

Value corresponding to the given edge filter key, formatted for CQL

Return type:

str

static format_node_filter(node_filters: Dict, key: str, variable: Optional[str] = None, prefix: Optional[str] = None, op: Optional[str] = None) str[source]#

Get the value for node filter as defined by key. This is used as a convenience method for generating cypher queries.

Parameters:
  • node_filters (Dict) – All node filters

  • key (str) – Name of the node filter

  • variable (Optional[str]) – Variable binding for cypher query

  • prefix (Optional[str]) – Prefix for the cypher

  • op (Optional[str]) – The operator

Returns:

Value corresponding to the given node filter key, formatted for CQL

Return type:

str

get_edges(skip: int = 0, limit: int = 0, is_directed: bool = True, **kwargs: Any) List[source]#

Get a page of edges from the Neo4j database.

Parameters:
  • skip (int) – Records to skip

  • limit (int) – Total number of records to query for

  • is_directed (bool) – Are edges directed or undirected (True, by default, since edges in most cases are directed)

  • kwargs (Any) – Any additional arguments

Returns:

A list of 3-tuples

Return type:

List

get_infores_catalog() Dict[str, str]#

Return the InfoRes Context of the source

get_nodes(skip: int = 0, limit: int = 0, **kwargs: Any) List[source]#

Get a page of nodes from the Neo4j database.

Parameters:
  • skip (int) – Records to skip

  • limit (int) – Total number of records to query for

  • kwargs (Any) – Any additional arguments

Returns:

A list of nodes

Return type:

List

get_pages(query_function, start: int = 0, end: Optional[int] = None, page_size: int = 50000, **kwargs: Any) Iterator[source]#

Get pages of size page_size from Neo4j. Returns an iterator of pages where number of pages is (end - start)/page_size

Parameters:
  • query_function (func) – The function to use to fetch records. Usually this is self.get_nodes or self.get_edges

  • start (int) – Start for pagination

  • end (Optional[int]) – End for pagination

  • page_size (int) – Size of each page (10000, by default)

  • kwargs (Dict) – Any additional arguments that might be relevant for query_function

Returns:

An iterator for a list of records from Neo4j. The size of the list is page_size

Return type:

Iterator

load_edge(edge_record: List) Tuple[source]#

Load an edge into an instance of BaseGraph

Parameters:

edge_record (List) – A 4-tuple edge record

Returns:

A tuple with subject ID, object ID, edge key, and edge data

Return type:

Tuple

load_edges(edges: List) None[source]#

Load edges into an instance of BaseGraph

Parameters:

edges (List) – A list of edge records

load_node(node_data: Dict) Optional[Tuple][source]#

Load node into an instance of BaseGraph

Parameters:

node_data (Dict) – A node

Returns:

A tuple with node ID and node data

Return type:

Tuple

load_nodes(nodes: List) Generator[source]#

Load nodes into an instance of BaseGraph

Parameters:

nodes (List) – A list of nodes

parse(uri: str, username: str, password: str, node_filters: Optional[Dict] = None, edge_filters: Optional[Dict] = None, start: int = 0, end: Optional[int] = None, is_directed: bool = True, page_size: int = 50000, **kwargs: Any) Generator[source]#

This method reads from Neo4j instance and yields records

Parameters:
  • uri (str) – The URI for the Neo4j instance. For example, http://localhost:7474

  • username (str) – The username

  • password (str) – The password

  • node_filters (Dict) – Node filters

  • edge_filters (Dict) – Edge filters

  • start (int) – Number of records to skip before streaming

  • end (int) – Total number of records to fetch

  • is_directed (bool) – Whether or not the edges should be treated as directed

  • page_size (int) – The size of each page/batch fetched from Neo4j (50000)

  • kwargs (Any) – Any additional arguments

Returns:

A generator for records

Return type:

Generator

set_edge_filter(key: str, value: set) None#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_provenance(node_data)#

Set a specific node provenance value.

set_prefix_map(m: Dict) None#

Update default prefix map.

Parameters:

m (Dict) – A dictionary with prefix to IRI mappings

set_provenance_map(kwargs)#

Set up a provenance (Knowledge Source to InfoRes) map

validate_edge(edge: Dict) Optional[Dict]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict

kgx.source.rdf_source#

RdfSource is responsible for reading data from RDF N-Triples.

This source makes use of a custom kgx.parsers.ntriples_parser.CustomNTriplesParser for parsing N-Triples, which extends rdflib.plugins.parsers.ntriples.W3CNTriplesParser.

To ensure proper parsing of N-Triples and a relatively low memory footprint, it is recommended that the N-Triples be sorted based on the subject IRIs.

sort -k 1,2 -t ' ' data.nt > data_sorted.nt
class kgx.source.rdf_source.RdfSource(owner)[source]#

Bases: Source

RdfSource is responsible for reading data as records from RDF.

Note

Currently only RDF N-Triples are supported.

add_edge(subject_iri: URIRef, object_iri: URIRef, predicate_iri: URIRef, data: Optional[Dict[Any, Any]] = None) Dict[source]#

Add an edge to cache.

Parameters:
  • subject_iri (rdflib.URIRef) – Subject IRI for the subject in a triple

  • object_iri (rdflib.URIRef) – Object IRI for the object in a triple

  • predicate_iri (rdflib.URIRef) – Predicate IRI for the predicate in a triple

  • data (Optional[Dict[Any, Any]]) – Additional edge properties

Returns:

The edge data

Return type:

Dict

add_node(iri: URIRef, data: Optional[Dict] = None) Dict[source]#

Add a node to cache.

Parameters:
  • iri (rdflib.URIRef) – IRI of a node

  • data (Optional[Dict]) – Additional node properties

Returns:

The node data

Return type:

Dict

add_node_attribute(iri: Union[URIRef, str], key: str, value: Union[str, List]) None[source]#

Add an attribute to a node in cache, while taking into account whether the attribute should be multi-valued.

The key may be a rdflib.URIRef or an URI string that maps onto a property name as defined in rdf_utils.property_mapping.

Parameters:
  • iri (Union[rdflib.URIRef, str]) – The IRI of a node in the rdflib.Graph

  • key (str) – The name of the attribute. Can be a rdflib.URIRef or URI string

  • value (Union[str, List]) – The value of the attribute

Returns:

The node data

Return type:

Dict

check_edge_filter(edge: Dict) bool#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

dereify(n: str, node: Dict) None[source]#

Dereify a node to create a corresponding edge.

Parameters:
  • n (str) – Node identifier

  • node (Dict) – Node data

Returns a Biolink Model element for a given predicate.

Parameters:

predicate (Any) – The CURIE of a predicate

Returns:

The corresponding Biolink Model element

Return type:

Optional[Element]

get_infores_catalog() Dict[str, str]#

Return the InfoRes Context of the source

parse(filename: str, format: str = 'nt', compression: Optional[str] = None, **kwargs: Any) Generator[source]#

This method reads from RDF N-Triples and yields records.

Note

To ensure proper parsing of N-Triples and a relatively low memory footprint, it is recommended that the N-Triples be sorted based on the subject IRIs.

`sort -k 1,2 -t ' ' data.nt > data_sorted.nt`

Parameters:
  • filename (str) – The filename to parse

  • format (str) – The format (nt)

  • compression (Optional[str]) – The compression type (gz)

  • kwargs (Any) – Any additional arguments

Returns:

A generator for records

Return type:

Generator

process_predicate(p: Optional[Union[URIRef, str]]) Tuple[source]#

Process a predicate where the method checks if there is a mapping in Biolink Model.

Parameters:

p (Optional[Union[URIRef, str]]) – The predicate

Returns:

A tuple that contains the Biolink CURIE (if available), the Biolink slot_uri CURIE (if available), the CURIE form of p, the reference of p

Return type:

Tuple

set_edge_filter(key: str, value: set) None#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_property_predicates(predicates) None[source]#

Set predicates that are to be treated as node properties.

Parameters:

predicates (Set) – Set of predicates

set_node_provenance(node_data)#

Set a specific node provenance value.

set_predicate_mapping(m: Dict) None[source]#

Set predicate mappings.

Use this method to update mappings for predicates that are not in Biolink Model.

Parameters:

m (Dict) – A dictionary where the keys are IRIs and values are their corresponding property names

set_prefix_map(m: Dict) None#

Update default prefix map.

Parameters:

m (Dict) – A dictionary with prefix to IRI mappings

set_provenance_map(kwargs)#

Set up a provenance (Knowledge Source to InfoRes) map

triple(s: URIRef, p: URIRef, o: URIRef) None[source]#

Parse a triple.

Parameters:
  • s (URIRef) – Subject

  • p (URIRef) – Predicate

  • o (URIRef) – Object

update_edge(subject_curie: str, object_curie: str, edge_key: str, data: Optional[Dict[Any, Any]]) Dict[source]#

Update an edge with properties.

Parameters:
  • subject_curie (str) – Subject CURIE

  • object_curie (str) – Object CURIE

  • edge_key (str) – Edge key

  • data (Optional[Dict[Any, Any]]) – Edge properties

Returns:

The edge data

Return type:

Dict

update_node(n: Union[URIRef, str], data: Optional[Dict] = None) Dict[source]#

Update a node with properties.

Parameters:
  • n (Union[URIRef, str]) – Node identifier

  • data (Optional[Dict]) – Node properties

Returns:

The node data

Return type:

Dict

validate_edge(edge: Dict) Optional[Dict]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict

kgx.source.owl_source#

OwlSource is responsible for parsing an OWL ontology.

When parsing an OWL, this source also adds OwlStar annotations to certain OWL axioms.

class kgx.source.owl_source.OwlSource(owner)[source]#

Bases: RdfSource

OwlSource is responsible for parsing an OWL ontology.

..note::

This is a simple parser that loads direct class-class relationships. For more formal OWL parsing, refer to Robot: http://robot.obolibrary.org/

add_edge(subject_iri: URIRef, object_iri: URIRef, predicate_iri: URIRef, data: Optional[Dict[Any, Any]] = None) Dict#

Add an edge to cache.

Parameters:
  • subject_iri (rdflib.URIRef) – Subject IRI for the subject in a triple

  • object_iri (rdflib.URIRef) – Object IRI for the object in a triple

  • predicate_iri (rdflib.URIRef) – Predicate IRI for the predicate in a triple

  • data (Optional[Dict[Any, Any]]) – Additional edge properties

Returns:

The edge data

Return type:

Dict

add_node(iri: URIRef, data: Optional[Dict] = None) Dict#

Add a node to cache.

Parameters:
  • iri (rdflib.URIRef) – IRI of a node

  • data (Optional[Dict]) – Additional node properties

Returns:

The node data

Return type:

Dict

add_node_attribute(iri: Union[URIRef, str], key: str, value: Union[str, List]) None#

Add an attribute to a node in cache, while taking into account whether the attribute should be multi-valued.

The key may be a rdflib.URIRef or an URI string that maps onto a property name as defined in rdf_utils.property_mapping.

Parameters:
  • iri (Union[rdflib.URIRef, str]) – The IRI of a node in the rdflib.Graph

  • key (str) – The name of the attribute. Can be a rdflib.URIRef or URI string

  • value (Union[str, List]) – The value of the attribute

Returns:

The node data

Return type:

Dict

check_edge_filter(edge: Dict) bool#

Check if an edge passes defined edge filters.

Parameters:

edge (Dict) – An edge

Returns:

Whether the given edge has passed all defined edge filters

Return type:

bool

check_node_filter(node: Dict) bool#

Check if a node passes defined node filters.

Parameters:

node (Dict) – A node

Returns:

Whether the given node has passed all defined node filters

Return type:

bool

clear_graph_metadata()#

Clears a Source graph’s internal graph_metadata. The value of such graph metadata is (now) generally a Callable function. This operation can be used in the code when the metadata is no longer needed, but may cause peculiar Python object persistent problems downstream.

dereify(n: str, node: Dict) None#

Dereify a node to create a corresponding edge.

Parameters:
  • n (str) – Node identifier

  • node (Dict) – Node data

Returns a Biolink Model element for a given predicate.

Parameters:

predicate (Any) – The CURIE of a predicate

Returns:

The corresponding Biolink Model element

Return type:

Optional[Element]

get_infores_catalog() Dict[str, str]#

Return the InfoRes Context of the source

load_graph(rdfgraph: Graph, **kwargs: Any) None[source]#

Walk through the rdflib.Graph and load all triples into kgx.graph.base_graph.BaseGraph

Parameters:
  • rdfgraph (rdflib.Graph) – Graph containing nodes and edges

  • kwargs (Any) – Any additional arguments

parse(filename: str, format: str = 'owl', compression: Optional[str] = None, **kwargs: Any) Generator[source]#

This method reads from an OWL and yields records.

Parameters:
  • filename (str) – The filename to parse

  • format (str) – The format (owl)

  • compression (Optional[str]) – The compression type (gz)

  • kwargs (Any) – Any additional arguments

Returns:

A generator for node and edge records read from the file

Return type:

Generator

process_predicate(p: Optional[Union[URIRef, str]]) Tuple#

Process a predicate where the method checks if there is a mapping in Biolink Model.

Parameters:

p (Optional[Union[URIRef, str]]) – The predicate

Returns:

A tuple that contains the Biolink CURIE (if available), the Biolink slot_uri CURIE (if available), the CURIE form of p, the reference of p

Return type:

Tuple

set_edge_filter(key: str, value: set) None#

Set an edge filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘subject_category’ or ‘object_category’ filter, the value should be of type set. This method also sets the ‘category’ node filter, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for edge filter

  • value (Union[str, set]) – The value for the edge filter. Can be either a string or a set.

set_edge_filters(filters: Dict) None#

Set edge filters.

Parameters:

filters (Dict) – Edge filters

set_edge_provenance(edge_data)#

Set a specific edge provenance value.

set_node_filter(key: str, value: Union[str, set]) None#

Set a node filter, as defined by a key and value pair. These filters are used to filter (or reduce) the search space when fetching nodes from the underlying store.

Note

When defining the ‘category’ filter, the value should be of type set. This method also sets the ‘subject_category’ and ‘object_category’ edge filters, to get a consistent set of nodes in the subgraph.

Parameters:
  • key (str) – The key for node filter

  • value (Union[str, set]) – The value for the node filter. Can be either a string or a set.

set_node_filters(filters: Dict) None#

Set node filters.

Parameters:

filters (Dict) – Node filters

set_node_property_predicates(predicates) None#

Set predicates that are to be treated as node properties.

Parameters:

predicates (Set) – Set of predicates

set_node_provenance(node_data)#

Set a specific node provenance value.

set_predicate_mapping(m: Dict) None#

Set predicate mappings.

Use this method to update mappings for predicates that are not in Biolink Model.

Parameters:

m (Dict) – A dictionary where the keys are IRIs and values are their corresponding property names

set_prefix_map(m: Dict) None#

Update default prefix map.

Parameters:

m (Dict) – A dictionary with prefix to IRI mappings

set_provenance_map(kwargs)#

Set up a provenance (Knowledge Source to InfoRes) map

triple(s: URIRef, p: URIRef, o: URIRef) None#

Parse a triple.

Parameters:
  • s (URIRef) – Subject

  • p (URIRef) – Predicate

  • o (URIRef) – Object

update_edge(subject_curie: str, object_curie: str, edge_key: str, data: Optional[Dict[Any, Any]]) Dict#

Update an edge with properties.

Parameters:
  • subject_curie (str) – Subject CURIE

  • object_curie (str) – Object CURIE

  • edge_key (str) – Edge key

  • data (Optional[Dict[Any, Any]]) – Edge properties

Returns:

The edge data

Return type:

Dict

update_node(n: Union[URIRef, str], data: Optional[Dict] = None) Dict#

Update a node with properties.

Parameters:
  • n (Union[URIRef, str]) – Node identifier

  • data (Optional[Dict]) – Node properties

Returns:

The node data

Return type:

Dict

validate_edge(edge: Dict) Optional[Dict]#

Given an edge as a dictionary, check for required properties. This method will return the edge dictionary with default assumptions applied, if any.

Parameters:

edge (Dict) – An edge represented as a dict

Returns:

An edge represented as a dict, with default assumptions applied.

Return type:

Dict

validate_node(node: Dict) Optional[Dict]#

Given a node as a dictionary, check for required properties. This method will return the node dictionary with default assumptions applied, if any.

Parameters:

node (Dict) – A node represented as a dict

Returns:

A node represented as a dict, with default assumptions applied.

Return type:

Dict