summarization.py 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564
  1. """
  2. Graph summarization finds smaller representations of graphs resulting in faster
  3. runtime of algorithms, reduced storage needs, and noise reduction.
  4. Summarization has applications in areas such as visualization, pattern mining,
  5. clustering and community detection, and more. Core graph summarization
  6. techniques are grouping/aggregation, bit-compression,
  7. simplification/sparsification, and influence based. Graph summarization
  8. algorithms often produce either summary graphs in the form of supergraphs or
  9. sparsified graphs, or a list of independent structures. Supergraphs are the
  10. most common product, which consist of supernodes and original nodes and are
  11. connected by edges and superedges, which represent aggregate edges between
  12. nodes and supernodes.
  13. Grouping/aggregation based techniques compress graphs by representing
  14. close/connected nodes and edges in a graph by a single node/edge in a
  15. supergraph. Nodes can be grouped together into supernodes based on their
  16. structural similarities or proximity within a graph to reduce the total number
  17. of nodes in a graph. Edge-grouping techniques group edges into lossy/lossless
  18. nodes called compressor or virtual nodes to reduce the total number of edges in
  19. a graph. Edge-grouping techniques can be lossless, meaning that they can be
  20. used to re-create the original graph, or techniques can be lossy, requiring
  21. less space to store the summary graph, but at the expense of lower
  22. reconstruction accuracy of the original graph.
  23. Bit-compression techniques minimize the amount of information needed to
  24. describe the original graph, while revealing structural patterns in the
  25. original graph. The two-part minimum description length (MDL) is often used to
  26. represent the model and the original graph in terms of the model. A key
  27. difference between graph compression and graph summarization is that graph
  28. summarization focuses on finding structural patterns within the original graph,
  29. whereas graph compression focuses on compressions the original graph to be as
  30. small as possible. **NOTE**: Some bit-compression methods exist solely to
  31. compress a graph without creating a summary graph or finding comprehensible
  32. structural patterns.
  33. Simplification/Sparsification techniques attempt to create a sparse
  34. representation of a graph by removing unimportant nodes and edges from the
  35. graph. Sparsified graphs differ from supergraphs created by
  36. grouping/aggregation by only containing a subset of the original nodes and
  37. edges of the original graph.
  38. Influence based techniques aim to find a high-level description of influence
  39. propagation in a large graph. These methods are scarce and have been mostly
  40. applied to social graphs.
  41. *dedensification* is a grouping/aggregation based technique to compress the
  42. neighborhoods around high-degree nodes in unweighted graphs by adding
  43. compressor nodes that summarize multiple edges of the same type to
  44. high-degree nodes (nodes with a degree greater than a given threshold).
  45. Dedensification was developed for the purpose of increasing performance of
  46. query processing around high-degree nodes in graph databases and enables direct
  47. operations on the compressed graph. The structural patterns surrounding
  48. high-degree nodes in the original is preserved while using fewer edges and
  49. adding a small number of compressor nodes. The degree of nodes present in the
  50. original graph is also preserved. The current implementation of dedensification
  51. supports graphs with one edge type.
  52. For more information on graph summarization, see `Graph Summarization Methods
  53. and Applications: A Survey <https://dl.acm.org/doi/abs/10.1145/3186727>`_
  54. """
  55. from collections import Counter, defaultdict
  56. import networkx as nx
  57. __all__ = ["dedensify", "snap_aggregation"]
  58. @nx._dispatchable(mutates_input={"not copy": 3}, returns_graph=True)
  59. def dedensify(G, threshold, prefix=None, copy=True):
  60. """Compresses neighborhoods around high-degree nodes
  61. Reduces the number of edges to high-degree nodes by adding compressor nodes
  62. that summarize multiple edges of the same type to high-degree nodes (nodes
  63. with a degree greater than a given threshold). Dedensification also has
  64. the added benefit of reducing the number of edges around high-degree nodes.
  65. The implementation currently supports graphs with a single edge type.
  66. Parameters
  67. ----------
  68. G: graph
  69. A networkx graph
  70. threshold: int
  71. Minimum degree threshold of a node to be considered a high degree node.
  72. The threshold must be greater than or equal to 2.
  73. prefix: str or None, optional (default: None)
  74. An optional prefix for denoting compressor nodes
  75. copy: bool, optional (default: True)
  76. Indicates if dedensification should be done inplace
  77. Returns
  78. -------
  79. dedensified networkx graph : (graph, set)
  80. 2-tuple of the dedensified graph and set of compressor nodes
  81. Notes
  82. -----
  83. According to the algorithm in [1]_, removes edges in a graph by
  84. compressing/decompressing the neighborhoods around high degree nodes by
  85. adding compressor nodes that summarize multiple edges of the same type
  86. to high-degree nodes. Dedensification will only add a compressor node when
  87. doing so will reduce the total number of edges in the given graph. This
  88. implementation currently supports graphs with a single edge type.
  89. Examples
  90. --------
  91. Dedensification will only add compressor nodes when doing so would result
  92. in fewer edges::
  93. >>> original_graph = nx.DiGraph()
  94. >>> original_graph.add_nodes_from(
  95. ... ["1", "2", "3", "4", "5", "6", "A", "B", "C"]
  96. ... )
  97. >>> original_graph.add_edges_from(
  98. ... [
  99. ... ("1", "C"), ("1", "B"),
  100. ... ("2", "C"), ("2", "B"), ("2", "A"),
  101. ... ("3", "B"), ("3", "A"), ("3", "6"),
  102. ... ("4", "C"), ("4", "B"), ("4", "A"),
  103. ... ("5", "B"), ("5", "A"),
  104. ... ("6", "5"),
  105. ... ("A", "6")
  106. ... ]
  107. ... )
  108. >>> c_graph, c_nodes = nx.dedensify(original_graph, threshold=2)
  109. >>> original_graph.number_of_edges()
  110. 15
  111. >>> c_graph.number_of_edges()
  112. 14
  113. A dedensified, directed graph can be "densified" to reconstruct the
  114. original graph::
  115. >>> original_graph = nx.DiGraph()
  116. >>> original_graph.add_nodes_from(
  117. ... ["1", "2", "3", "4", "5", "6", "A", "B", "C"]
  118. ... )
  119. >>> original_graph.add_edges_from(
  120. ... [
  121. ... ("1", "C"), ("1", "B"),
  122. ... ("2", "C"), ("2", "B"), ("2", "A"),
  123. ... ("3", "B"), ("3", "A"), ("3", "6"),
  124. ... ("4", "C"), ("4", "B"), ("4", "A"),
  125. ... ("5", "B"), ("5", "A"),
  126. ... ("6", "5"),
  127. ... ("A", "6")
  128. ... ]
  129. ... )
  130. >>> c_graph, c_nodes = nx.dedensify(original_graph, threshold=2)
  131. >>> # re-densifies the compressed graph into the original graph
  132. >>> for c_node in c_nodes:
  133. ... all_neighbors = set(nx.all_neighbors(c_graph, c_node))
  134. ... out_neighbors = set(c_graph.neighbors(c_node))
  135. ... for out_neighbor in out_neighbors:
  136. ... c_graph.remove_edge(c_node, out_neighbor)
  137. ... in_neighbors = all_neighbors - out_neighbors
  138. ... for in_neighbor in in_neighbors:
  139. ... c_graph.remove_edge(in_neighbor, c_node)
  140. ... for out_neighbor in out_neighbors:
  141. ... c_graph.add_edge(in_neighbor, out_neighbor)
  142. ... c_graph.remove_node(c_node)
  143. ...
  144. >>> nx.is_isomorphic(original_graph, c_graph)
  145. True
  146. References
  147. ----------
  148. .. [1] Maccioni, A., & Abadi, D. J. (2016, August).
  149. Scalable pattern matching over compressed graphs via dedensification.
  150. In Proceedings of the 22nd ACM SIGKDD International Conference on
  151. Knowledge Discovery and Data Mining (pp. 1755-1764).
  152. http://www.cs.umd.edu/~abadi/papers/graph-dedense.pdf
  153. """
  154. if threshold < 2:
  155. raise nx.NetworkXError("The degree threshold must be >= 2")
  156. degrees = G.in_degree if G.is_directed() else G.degree
  157. # Group nodes based on degree threshold
  158. high_degree_nodes = {n for n, d in degrees if d > threshold}
  159. low_degree_nodes = G.nodes() - high_degree_nodes
  160. auxiliary = {}
  161. for node in G:
  162. high_degree_nbrs = frozenset(high_degree_nodes & set(G[node]))
  163. if high_degree_nbrs:
  164. if high_degree_nbrs in auxiliary:
  165. auxiliary[high_degree_nbrs].add(node)
  166. else:
  167. auxiliary[high_degree_nbrs] = {node}
  168. if copy:
  169. G = G.copy()
  170. compressor_nodes = set()
  171. for index, (high_degree_nodes, low_degree_nodes) in enumerate(auxiliary.items()):
  172. low_degree_node_count = len(low_degree_nodes)
  173. high_degree_node_count = len(high_degree_nodes)
  174. old_edges = high_degree_node_count * low_degree_node_count
  175. new_edges = high_degree_node_count + low_degree_node_count
  176. if old_edges <= new_edges:
  177. continue
  178. compression_node = "".join(str(node) for node in high_degree_nodes)
  179. if prefix:
  180. compression_node = str(prefix) + compression_node
  181. for node in low_degree_nodes:
  182. for high_node in high_degree_nodes:
  183. if G.has_edge(node, high_node):
  184. G.remove_edge(node, high_node)
  185. G.add_edge(node, compression_node)
  186. for node in high_degree_nodes:
  187. G.add_edge(compression_node, node)
  188. compressor_nodes.add(compression_node)
  189. return G, compressor_nodes
  190. def _snap_build_graph(
  191. G,
  192. groups,
  193. node_attributes,
  194. edge_attributes,
  195. neighbor_info,
  196. edge_types,
  197. prefix,
  198. supernode_attribute,
  199. superedge_attribute,
  200. ):
  201. """
  202. Build the summary graph from the data structures produced in the SNAP aggregation algorithm
  203. Used in the SNAP aggregation algorithm to build the output summary graph and supernode
  204. lookup dictionary. This process uses the original graph and the data structures to
  205. create the supernodes with the correct node attributes, and the superedges with the correct
  206. edge attributes
  207. Parameters
  208. ----------
  209. G: networkx.Graph
  210. the original graph to be summarized
  211. groups: dict
  212. A dictionary of unique group IDs and their corresponding node groups
  213. node_attributes: iterable
  214. An iterable of the node attributes considered in the summarization process
  215. edge_attributes: iterable
  216. An iterable of the edge attributes considered in the summarization process
  217. neighbor_info: dict
  218. A data structure indicating the number of edges a node has with the
  219. groups in the current summarization of each edge type
  220. edge_types: dict
  221. dictionary of edges in the graph and their corresponding attributes recognized
  222. in the summarization
  223. prefix: string
  224. The prefix to be added to all supernodes
  225. supernode_attribute: str
  226. The node attribute for recording the supernode groupings of nodes
  227. superedge_attribute: str
  228. The edge attribute for recording the edge types represented by superedges
  229. Returns
  230. -------
  231. summary graph: Networkx graph
  232. """
  233. output = G.__class__()
  234. node_label_lookup = {}
  235. for index, group_id in enumerate(groups):
  236. group_set = groups[group_id]
  237. supernode = f"{prefix}{index}"
  238. node_label_lookup[group_id] = supernode
  239. supernode_attributes = {
  240. attr: G.nodes[next(iter(group_set))][attr] for attr in node_attributes
  241. }
  242. supernode_attributes[supernode_attribute] = group_set
  243. output.add_node(supernode, **supernode_attributes)
  244. for group_id in groups:
  245. group_set = groups[group_id]
  246. source_supernode = node_label_lookup[group_id]
  247. for other_group, group_edge_types in neighbor_info[
  248. next(iter(group_set))
  249. ].items():
  250. if group_edge_types:
  251. target_supernode = node_label_lookup[other_group]
  252. summary_graph_edge = (source_supernode, target_supernode)
  253. edge_types = [
  254. dict(zip(edge_attributes, edge_type))
  255. for edge_type in group_edge_types
  256. ]
  257. has_edge = output.has_edge(*summary_graph_edge)
  258. if output.is_multigraph():
  259. if not has_edge:
  260. for edge_type in edge_types:
  261. output.add_edge(*summary_graph_edge, **edge_type)
  262. elif not output.is_directed():
  263. existing_edge_data = output.get_edge_data(*summary_graph_edge)
  264. for edge_type in edge_types:
  265. if edge_type not in existing_edge_data.values():
  266. output.add_edge(*summary_graph_edge, **edge_type)
  267. else:
  268. superedge_attributes = {superedge_attribute: edge_types}
  269. output.add_edge(*summary_graph_edge, **superedge_attributes)
  270. return output
  271. def _snap_eligible_group(G, groups, group_lookup, edge_types):
  272. """
  273. Determines if a group is eligible to be split.
  274. A group is eligible to be split if all nodes in the group have edges of the same type(s)
  275. with the same other groups.
  276. Parameters
  277. ----------
  278. G: graph
  279. graph to be summarized
  280. groups: dict
  281. A dictionary of unique group IDs and their corresponding node groups
  282. group_lookup: dict
  283. dictionary of nodes and their current corresponding group ID
  284. edge_types: dict
  285. dictionary of edges in the graph and their corresponding attributes recognized
  286. in the summarization
  287. Returns
  288. -------
  289. tuple: group ID to split, and neighbor-groups participation_counts data structure
  290. """
  291. nbr_info = {node: {gid: Counter() for gid in groups} for node in group_lookup}
  292. for group_id in groups:
  293. current_group = groups[group_id]
  294. # build nbr_info for nodes in group
  295. for node in current_group:
  296. nbr_info[node] = {group_id: Counter() for group_id in groups}
  297. edges = G.edges(node, keys=True) if G.is_multigraph() else G.edges(node)
  298. for edge in edges:
  299. neighbor = edge[1]
  300. edge_type = edge_types[edge]
  301. neighbor_group_id = group_lookup[neighbor]
  302. nbr_info[node][neighbor_group_id][edge_type] += 1
  303. # check if group_id is eligible to be split
  304. group_size = len(current_group)
  305. for other_group_id in groups:
  306. edge_counts = Counter()
  307. for node in current_group:
  308. edge_counts.update(nbr_info[node][other_group_id].keys())
  309. if not all(count == group_size for count in edge_counts.values()):
  310. # only the nbr_info of the returned group_id is required for handling group splits
  311. return group_id, nbr_info
  312. # if no eligible groups, complete nbr_info is calculated
  313. return None, nbr_info
  314. def _snap_split(groups, neighbor_info, group_lookup, group_id):
  315. """
  316. Splits a group based on edge types and updates the groups accordingly
  317. Splits the group with the given group_id based on the edge types
  318. of the nodes so that each new grouping will all have the same
  319. edges with other nodes.
  320. Parameters
  321. ----------
  322. groups: dict
  323. A dictionary of unique group IDs and their corresponding node groups
  324. neighbor_info: dict
  325. A data structure indicating the number of edges a node has with the
  326. groups in the current summarization of each edge type
  327. edge_types: dict
  328. dictionary of edges in the graph and their corresponding attributes recognized
  329. in the summarization
  330. group_lookup: dict
  331. dictionary of nodes and their current corresponding group ID
  332. group_id: object
  333. ID of group to be split
  334. Returns
  335. -------
  336. dict
  337. The updated groups based on the split
  338. """
  339. new_group_mappings = defaultdict(set)
  340. for node in groups[group_id]:
  341. signature = tuple(
  342. frozenset(edge_types) for edge_types in neighbor_info[node].values()
  343. )
  344. new_group_mappings[signature].add(node)
  345. # leave the biggest new_group as the original group
  346. new_groups = sorted(new_group_mappings.values(), key=len)
  347. for new_group in new_groups[:-1]:
  348. # Assign unused integer as the new_group_id
  349. # ids are tuples, so will not interact with the original group_ids
  350. new_group_id = len(groups)
  351. groups[new_group_id] = new_group
  352. groups[group_id] -= new_group
  353. for node in new_group:
  354. group_lookup[node] = new_group_id
  355. return groups
  356. @nx._dispatchable(
  357. node_attrs="[node_attributes]", edge_attrs="[edge_attributes]", returns_graph=True
  358. )
  359. def snap_aggregation(
  360. G,
  361. node_attributes,
  362. edge_attributes=(),
  363. prefix="Supernode-",
  364. supernode_attribute="group",
  365. superedge_attribute="types",
  366. ):
  367. """Creates a summary graph based on attributes and connectivity.
  368. This function uses the Summarization by Grouping Nodes on Attributes
  369. and Pairwise edges (SNAP) algorithm for summarizing a given
  370. graph by grouping nodes by node attributes and their edge attributes
  371. into supernodes in a summary graph. This name SNAP should not be
  372. confused with the Stanford Network Analysis Project (SNAP).
  373. Here is a high-level view of how this algorithm works:
  374. 1) Group nodes by node attribute values.
  375. 2) Iteratively split groups until all nodes in each group have edges
  376. to nodes in the same groups. That is, until all the groups are homogeneous
  377. in their member nodes' edges to other groups. For example,
  378. if all the nodes in group A only have edge to nodes in group B, then the
  379. group is homogeneous and does not need to be split. If all nodes in group B
  380. have edges with nodes in groups {A, C}, but some also have edges with other
  381. nodes in B, then group B is not homogeneous and needs to be split into
  382. groups have edges with {A, C} and a group of nodes having
  383. edges with {A, B, C}. This way, viewers of the summary graph can
  384. assume that all nodes in the group have the exact same node attributes and
  385. the exact same edges.
  386. 3) Build the output summary graph, where the groups are represented by
  387. super-nodes. Edges represent the edges shared between all the nodes in each
  388. respective groups.
  389. A SNAP summary graph can be used to visualize graphs that are too large to display
  390. or visually analyze, or to efficiently identify sets of similar nodes with similar connectivity
  391. patterns to other sets of similar nodes based on specified node and/or edge attributes in a graph.
  392. Parameters
  393. ----------
  394. G: graph
  395. Networkx Graph to be summarized
  396. node_attributes: iterable, required
  397. An iterable of the node attributes used to group nodes in the summarization process. Nodes
  398. with the same values for these attributes will be grouped together in the summary graph.
  399. edge_attributes: iterable, optional
  400. An iterable of the edge attributes considered in the summarization process. If provided, unique
  401. combinations of the attribute values found in the graph are used to
  402. determine the edge types in the graph. If not provided, all edges
  403. are considered to be of the same type.
  404. prefix: str
  405. The prefix used to denote supernodes in the summary graph. Defaults to 'Supernode-'.
  406. supernode_attribute: str
  407. The node attribute for recording the supernode groupings of nodes. Defaults to 'group'.
  408. superedge_attribute: str
  409. The edge attribute for recording the edge types of multiple edges. Defaults to 'types'.
  410. Returns
  411. -------
  412. networkx.Graph: summary graph
  413. Examples
  414. --------
  415. SNAP aggregation takes a graph and summarizes it in the context of user-provided
  416. node and edge attributes such that a viewer can more easily extract and
  417. analyze the information represented by the graph
  418. >>> nodes = {
  419. ... "A": dict(color="Red"),
  420. ... "B": dict(color="Red"),
  421. ... "C": dict(color="Red"),
  422. ... "D": dict(color="Red"),
  423. ... "E": dict(color="Blue"),
  424. ... "F": dict(color="Blue"),
  425. ... }
  426. >>> edges = [
  427. ... ("A", "E", "Strong"),
  428. ... ("B", "F", "Strong"),
  429. ... ("C", "E", "Weak"),
  430. ... ("D", "F", "Weak"),
  431. ... ]
  432. >>> G = nx.Graph()
  433. >>> for node in nodes:
  434. ... attributes = nodes[node]
  435. ... G.add_node(node, **attributes)
  436. >>> for source, target, type in edges:
  437. ... G.add_edge(source, target, type=type)
  438. >>> node_attributes = ("color",)
  439. >>> edge_attributes = ("type",)
  440. >>> summary_graph = nx.snap_aggregation(
  441. ... G, node_attributes=node_attributes, edge_attributes=edge_attributes
  442. ... )
  443. Notes
  444. -----
  445. The summary graph produced is called a maximum Attribute-edge
  446. compatible (AR-compatible) grouping. According to [1]_, an
  447. AR-compatible grouping means that all nodes in each group have the same
  448. exact node attribute values and the same exact edges and
  449. edge types to one or more nodes in the same groups. The maximal
  450. AR-compatible grouping is the grouping with the minimal cardinality.
  451. The AR-compatible grouping is the most detailed grouping provided by
  452. any of the SNAP algorithms.
  453. References
  454. ----------
  455. .. [1] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient aggregation
  456. for graph summarization. In Proc. 2008 ACM-SIGMOD Int. Conf.
  457. Management of Data (SIGMOD’08), pages 567–580, Vancouver, Canada,
  458. June 2008.
  459. """
  460. edge_types = {
  461. edge: tuple(attrs.get(attr) for attr in edge_attributes)
  462. for edge, attrs in G.edges.items()
  463. }
  464. if not G.is_directed():
  465. if G.is_multigraph():
  466. # list is needed to avoid mutating while iterating
  467. edges = [((v, u, k), etype) for (u, v, k), etype in edge_types.items()]
  468. else:
  469. # list is needed to avoid mutating while iterating
  470. edges = [((v, u), etype) for (u, v), etype in edge_types.items()]
  471. edge_types.update(edges)
  472. group_lookup = {
  473. node: tuple(attrs[attr] for attr in node_attributes)
  474. for node, attrs in G.nodes.items()
  475. }
  476. groups = defaultdict(set)
  477. for node, node_type in group_lookup.items():
  478. groups[node_type].add(node)
  479. eligible_group_id, nbr_info = _snap_eligible_group(
  480. G, groups, group_lookup, edge_types
  481. )
  482. while eligible_group_id:
  483. groups = _snap_split(groups, nbr_info, group_lookup, eligible_group_id)
  484. eligible_group_id, nbr_info = _snap_eligible_group(
  485. G, groups, group_lookup, edge_types
  486. )
  487. return _snap_build_graph(
  488. G,
  489. groups,
  490. node_attributes,
  491. edge_attributes,
  492. nbr_info,
  493. edge_types,
  494. prefix,
  495. supernode_attribute,
  496. superedge_attribute,
  497. )