Summarize Graph#
The Summarize Graph operation takes an instance of kgx.graph.base_graph.BaseGraph
and
generates summary statistics for the entire graph.
This operation generates summary as a YAML (or JSON) in a format that is compatible with the Knowledge Graph Hub dashboard.
The main entry point is the kgx.graph_operations.summarize_graph.generate_graph_stats
method.
The tool does detect and logs anomalies in the graph (defaults reporting to stderr, but may be reset to a file using the error_log
parameter)
Note: To generate a summary statistics YAML that is consistent with Translator API (TRAPI) Release 1.1 standards, refer to Meta Knowledge Graph.
Streaming Data Processing Mode#
For very large graphs, the Graph Summary operation may now successfully process graph data equally well using data streaming (command flag --stream=True
) which significantly minimizes the memory footprint required to process such graphs.
kgx.graph_operations.summarize_graph#
Classical KGX graph summary module.
- class kgx.graph_operations.summarize_graph.GraphSummary(name='', node_facet_properties: Optional[List] = None, edge_facet_properties: Optional[List] = None, progress_monitor: Optional[Callable[[GraphEntityType, List], None]] = None, error_log: Optional[str] = None, **kwargs)[source]#
Bases:
ErrorDetecting
Class for generating a “classical” knowledge graph summary.
The optional ‘progress_monitor’ for the validator should be a lightweight Callable which is injected into the class ‘inspector’ Callable, designed to intercepts node and edge records streaming through the Validator (inside a Transformer.process() call. The first (GraphEntityType) argument of the Callable tags the record as a NODE or an EDGE. The second argument given to the Callable is the current record itself. This Callable is strictly meant to be procedural and should not mutate the record. The intent of this Callable is to provide a hook to KGX applications wanting the namesake function of passively monitoring the graph data stream. As such, the Callable could simply tally up the number of times it is called with a NODE or an EDGE, then provide a suitable (quick!) report of that count back to the KGX application. The Callable (function/callable class) should not modify the record and should be of low complexity, so as not to introduce a large computational overhead to validation!
- class Category(category_curie: str, summary)[source]#
Bases:
object
Internal class for compiling statistics about a distinct category.
- analyse_node_category(summary, n, data)[source]#
Analyse metadata of a given graph node record of this category.
- Parameters:
summary (GraphSummary) – GraphSunmmary within which the Category is being analysed.
n (str) – Curie identifier of the node record (not used here).
data (Dict) – Complete data dictionary of node record fields.
- get_cid() int [source]#
- Returns:
Internal GraphSummary index id for tracking a Category.
- Return type:
- get_count_by_id_prefixes()[source]#
- Returns:
Count of nodes by id_prefixes for nodes which have this category.
- Return type:
- add_node_stat(tag: str, value: Any)[source]#
Compile/add a nodes statistic for a given tag = value annotation of the node.
- Parameters:
tag (str) –
value (Any) –
tag – Tag label for the annotation.
value – Value of the specific tag annotation.
- Returns:
- analyse_edge(u: str, v: str, k: str, data: Dict)[source]#
Analyse metadata of one graph edge record.
- analyse_node(n, data)[source]#
Analyse metadata of one graph node record.
- Parameters:
n (str) – Curie identifier of the node record (not used here).
data (Dict) – Complete data dictionary of node record fields.
- clear_errors()#
Clears the current error log list
- get_category(category_curie: str) Category [source]#
Counts the number of distinct (Biolink) categories encountered in the knowledge graph (not including those of ‘unknown’ category)
- get_errors(level: Optional[str] = None) Dict #
Get the index list of distinct error messages.
- Parameters:
level (str) – Optional filter (case insensitive) name of error message level (generally either “Error” or “Warning”)
- Returns:
A raw dictionary of entities indexed by [message_level][error_type][message] or only just [error_type][message] specific to a given message level if the optional level filter is given
- Return type:
Dict
- get_facet_counts(data: Dict, stats: Dict, x: str, y: str, facet_property: str) Dict [source]#
Facet on
facet_property
and record the count forstats[x][y][facet_property]
.
- get_graph_summary(name: Optional[str] = None, **kwargs) Dict [source]#
Similar to summarize_graph except that the node and edge statistics are already captured in the GraphSummary class instance (perhaps by Transformer.process() stream inspection) and therefore, the data structure simply needs to be ‘finalized’ for saving or similar use.
- Parameters:
name (Optional[str]) – Name for the graph (if being renamed)
kwargs (Dict) – Any additional arguments (ignored in this method at present)
- Returns:
A knowledge map dictionary corresponding to the graph
- Return type:
Dict
- get_node_stats() Dict[str, Any] [source]#
- Returns:
Statistics for the nodes in the graph.
- Return type:
Dict[str, Any]
- log_error(entity: str, error_type: ErrorType, message: str, message_level: MessageLevel = MessageLevel.ERROR)#
Log an error to the list of such errors.
- Parameters:
entity – source of parse error
error_type – ValidationError ErrorType,
message – message string describing the error
message_level – ValidationError MessageLevel
- save(file, name: Optional[str] = None, file_format: str = 'yaml')[source]#
Save the current GraphSummary to a specified (open) file (device).
- summarize_graph(graph: BaseGraph) Dict [source]#
Summarize the entire graph.
- Parameters:
graph (kgx.graph.base_graph.BaseGraph) – The graph
- Returns:
The stats dictionary
- Return type:
Dict
- summarize_graph_edges(graph: BaseGraph) Dict [source]#
Summarize the edges in a graph.
- Parameters:
graph (kgx.graph.base_graph.BaseGraph) – The graph
- Returns:
The edge stats
- Return type:
Dict
- summarize_graph_nodes(graph: BaseGraph) Dict [source]#
Summarize the nodes in a graph.
- Parameters:
graph (kgx.graph.base_graph.BaseGraph) – The graph
- Returns:
The node stats
- Return type:
Dict
- kgx.graph_operations.summarize_graph.generate_graph_stats(graph: BaseGraph, graph_name: str, filename: str, node_facet_properties: Optional[List] = None, edge_facet_properties: Optional[List] = None) None [source]#
Generate stats from Graph.
- Parameters:
graph (kgx.graph.base_graph.BaseGraph) – The graph
graph_name (str) – Name for the graph
filename (str) – Filename to write the stats to
node_facet_properties (Optional[List]) – A list of properties to facet on. For example,
['provided_by']
edge_facet_properties (Optional[List]) – A list of properties to facet on. For example,
['knowledge_source']
Deprecated since version 1.5.8: Default is the use streaming graph_summary with inspector
- kgx.graph_operations.summarize_graph.gs_default(o)[source]#
JSONEncoder ‘default’ function override to properly serialize ‘Set’ objects (into ‘List’) :param o
- kgx.graph_operations.summarize_graph.summarize_graph(graph: BaseGraph, name: Optional[str] = None, node_facet_properties: Optional[List] = None, edge_facet_properties: Optional[List] = None) Dict [source]#
Summarize the entire graph.
- Parameters:
graph (kgx.graph.base_graph.BaseGraph) – The graph
name (str) – Name for the graph
node_facet_properties (Optional[List]) – A list of properties to facet on. For example,
['provided_by']
edge_facet_properties (Optional[List]) – A list of properties to facet on. For example,
['knowledge_source']
- Returns:
The stats dictionary
- Return type:
Dict