Clique Merge#
The Clique Merge operation performs a series of operations on your target (input) graph:
Build cliques from nodes in the target graph
Elect a leader for each individual clique
Move all edges in a clique to the leader node
The main entry point is kgx.graph_operations.clique_merge.clique_merge
method which
takes an instance of kgx.graph.base_graph.BaseGraph
.
Build cliques from nodes in the target graph
Given a target graph, create a clique graph where nodes in the same clique are connected via
biolink:same_as
edges.
In the target graph, you can define nodes that belong to the same clique as follows:
Having
biolink:same_as
edges between nodes (preferred and consistent with Biolink Model)Having
same_as
node property on a node that lists all equivalent nodes (deprecated)
Elect a leader for each individual clique
Once the clique graph is built, go through each clique and elect a representative node or leader node for that clique.
Elect leader for each clique based on three election criteria, listed in the order in which they are checked:
Leader annotation: Elect the leader node for a clique based on
clique_leader
annotation on the nodePrefix prioritization: Elect the leader node for a clique that has a prefix which is of the highest priority in the identifier prefixes list, as defined in the Biolink Model
Prefix prioritization fallback: Elect the leader node for a clique that has a prefix which is the first in an alphabetically sorted list of all ID prefixes within the clique
Move all edges in a clique to the leader node
The last step is edge consolidation where all the edges from nodes in a clique are moved to the leader node.
The original subject and object node of an edge is tracked via the _original_subject
and
_original_object
edge property.
kgx.graph_operations.clique_merge#
- kgx.graph_operations.clique_merge.build_cliques(target_graph: BaseGraph) MultiDiGraph [source]#
Builds a clique graph from
same_as
edges intarget_graph
.- Parameters:
target_graph (kgx.graph.base_graph.BaseGraph) – An instance of BaseGraph that contains nodes and edges
- Returns:
The clique graph with only
same_as
edges- Return type:
networkx.MultiDiGraph
- kgx.graph_operations.clique_merge.check_all_categories(categories) Tuple[List, List, List] [source]#
Check all categories in
categories
.- Parameters:
categories (List) – A list of categories
- Returns:
Tuple[List, List, List] – A tuple consisting of valid biolink categories, invalid biolink categories, and invalid categories
Note (the sort_categories method will re-arrange the passed in category list according to the distance)
of each list member from the top of their hierarchy. Each category’s hierarchy is made up of its
’is_a’ and mixin ancestors.
- kgx.graph_operations.clique_merge.check_categories(categories: List, closure: List, category_mapping: Optional[Dict[str, str]] = None) Tuple[List, List, List] [source]#
Check categories to ensure whether values in
categories
are valid biolink categories. Valid biolink categories are classes that descend from ‘NamedThing’. Mixins, while valid ancestors, are not valid categories.- Parameters:
- Returns:
A tuple consisting of valid biolink categories, invalid biolink categories, and invalid categories
- Return type:
Tuple[List, List, List]
- kgx.graph_operations.clique_merge.clique_merge(target_graph: BaseGraph, leader_annotation: Optional[str] = None, prefix_prioritization_map: Optional[Dict[str, List[str]]] = None, category_mapping: Optional[Dict[str, str]] = None, strict: bool = True) Tuple[BaseGraph, MultiDiGraph] [source]#
- Parameters:
target_graph (kgx.graph.base_graph.BaseGraph) – The original graph
leader_annotation (str) – The field on a node that signifies that the node is the leader of a clique
prefix_prioritization_map (Optional[Dict[str, List[str]]]) – A map that gives a prefix priority for one or more categories
category_mapping (Optional[Dict[str, str]]) – Mapping for non-Biolink Model categories to Biolink Model categories
strict (bool) – Whether or not to merge nodes in a clique that have conflicting node categories
- Returns:
A tuple containing the updated target graph, and the clique graph
- Return type:
Tuple[kgx.graph.base_graph.BaseGraph, networkx.MultiDiGraph]
- kgx.graph_operations.clique_merge.consolidate_edges(target_graph: BaseGraph, clique_graph: MultiDiGraph, leader_annotation: str) BaseGraph [source]#
Move all edges from nodes in a clique to the clique leader.
Original subject and object of a node are preserved via
ORIGINAL_SUBJECT_PROPERTY
andORIGINAL_OBJECT_PROPERTY
- Parameters:
target_graph (kgx.graph.base_graph.BaseGraph) – The original graph
clique_graph (networkx.MultiDiGraph) – The clique graph
leader_annotation (str) – The field on a node that signifies that the node is the leader of a clique
- Returns:
The target graph where all edges from nodes in a clique are moved to clique leader
- Return type:
- kgx.graph_operations.clique_merge.elect_leader(target_graph: BaseGraph, clique_graph: MultiDiGraph, leader_annotation: str, prefix_prioritization_map: Optional[Dict[str, List[str]]], category_mapping: Optional[Dict[str, str]], strict: bool = True) BaseGraph [source]#
Elect leader for each clique in a graph.
- Parameters:
target_graph (kgx.graph.base_graph.BaseGraph) – The original graph
clique_graph (networkx.Graph) – The clique graph
leader_annotation (str) – The field on a node that signifies that the node is the leader of a clique
prefix_prioritization_map (Optional[Dict[str, List[str]]]) – A map that gives a prefix priority for one or more categories
category_mapping (Optional[Dict[str, str]]) – Mapping for non-Biolink Model categories to Biolink Model categories
strict (bool) – Whether or not to merge nodes in a clique that have conflicting node categories
- Returns:
The updated target graph
- Return type:
- kgx.graph_operations.clique_merge.get_category_from_equivalence(target_graph: BaseGraph, clique_graph: MultiDiGraph, node: str, attributes: Dict) List [source]#
Get category for a node based on its equivalent nodes in a graph.
- Parameters:
target_graph (kgx.graph.base_graph.BaseGraph) – The original graph
clique_graph (networkx.MultiDiGraph) – The clique graph
node (str) – Node identifier
attributes (Dict) – Node’s attributes
- Returns:
Category for the node
- Return type:
List
- kgx.graph_operations.clique_merge.get_clique_category(clique_graph: MultiDiGraph, clique: List) Tuple[str, List] [source]#
Given a clique, identify the category of the clique.
- kgx.graph_operations.clique_merge.get_leader_by_annotation(target_graph: BaseGraph, clique_graph: MultiDiGraph, clique: List, leader_annotation: str) Tuple[Optional[str], Optional[str]] [source]#
Get leader by searching for leader annotation property in any of the nodes in a given clique.
- Parameters:
target_graph (kgx.graph.base_graph.BaseGraph) – The original graph
clique_graph (networkx.MultiDiGraph) – The clique graph
clique (List) – A list of nodes from a clique
leader_annotation (str) – The field on a node that signifies that the node is the leader of a clique
- Returns:
A tuple containing the node that has been elected as the leader and the election strategy
- Return type:
- kgx.graph_operations.clique_merge.get_leader_by_prefix_priority(target_graph: BaseGraph, clique_graph: MultiDiGraph, clique: List, prefix_priority_list: List) Tuple[Optional[str], Optional[str]] [source]#
Get leader from clique based on a given prefix priority.
- Parameters:
target_graph (kgx.graph.base_graph.BaseGraph) – The original graph
clique_graph (networkx.MultiDiGraph) – The clique graph
clique (List) – A list of nodes that correspond to a clique
prefix_priority_list (List) – A list of prefixes in descending priority
- Returns:
A tuple containing the node that has been elected as the leader and the election strategy
- Return type:
- kgx.graph_operations.clique_merge.get_leader_by_sort(target_graph: BaseGraph, clique_graph: MultiDiGraph, clique: List) Tuple[Optional[str], Optional[str]] [source]#
Get leader from clique based on the first selection from an alphabetical sort of the node id prefixes.
- Parameters:
target_graph (kgx.graph.base_graph.BaseGraph) – The original graph
clique_graph (networkx.MultiDiGraph) – The clique graph
clique (List) – A list of nodes that correspond to a clique
- Returns:
A tuple containing the node that has been elected as the leader and the election strategy
- Return type:
- kgx.graph_operations.clique_merge.sort_categories(categories: Union[List, Set, OrderedSet]) List [source]#
Sort a list of categories from most specific to the most generic.
- Parameters:
categories (Union[List, Set, OrderedSet]) – A list of categories
- Returns:
A sorted list of categories where sorted means that the first element in the list returned has the most number of parents in the class hierarchy.
- Return type:
List
- kgx.graph_operations.clique_merge.update_node_categories(target_graph: BaseGraph, clique_graph: MultiDiGraph, clique: List, category_mapping: Optional[Dict[str, str]], strict: bool = True) List [source]#
For a given clique, get category for each node in clique and validate against Biolink Model, mapping to Biolink Model category where needed.
For example, If a node has
biolink:Gene
as its category, then this method adds all of its ancestors.- Parameters:
target_graph (kgx.graph.base_graph.BaseGraph) – The original graph
clique_graph (networkx.Graph) – The clique graph
clique (List) – A list of nodes from a clique
category_mapping (Optional[Dict[str, str]]) – Mapping for non-Biolink Model categories to Biolink Model categories
strict (bool) – Whether or not to merge nodes in a clique that have conflicting node categories
- Returns:
The clique
- Return type:
List