使用Python基于多列生成图形结构

2 投票
2 回答
74 浏览
提问于 2025-04-13 15:55

我有一个数据框,里面有以下几列:Node1、Node2、Node1_REV 和 Node2_REV。在这个结构中,Node1 是父节点,而 Node2 是子节点。Node1_REV 和 Node2_REV 则记录了各自节点的不同版本。子节点的值可能还有自己的子值,这些在 ChildNode 和 ChildNode_Rev 列中表示。

举个例子,B1/1、B2/1 和 B2/2 在 ChildNode/ChildNode_Rev 列中都有自己的子值。此外,像 B1/1 这样的值可能会出现在不同的父节点中,比如 A1/1 和 A2/1,而像 B2/1 的值可能会出现在 A1/2 中,B2/2 这样的不同版本可能会出现在另一个父节点 A2/2 中。

这种层级关系会一直延续,直到找到一个父节点,它不再作为任何其他值的子节点。

为了生成所需的输出,需要遍历所有的值,识别出所有可能的层级和组合,直到到达顶层节点。在检查可能的连接时,Node 和 Rev 的组合应该被视为唯一,而不仅仅是 Node 的值。

  • 示例输入数据框:
Node1 Node1_Rev Node2 Node2_REV REF DES
A1 1 B1 1 11 QR
A1 2 B2 1 12 SG
A2 1 B1 1 11 QR
A2 2 B2 2 12 SG
B1 1 K5 1 11 QR
B1 1 K1 1 11 QR
B2 2 T1 1 12 S1
A3 2 G7 2 12 G2A3
A3 3 H9 1 23 ...
B2 1 J1 3 22 SSD
import pandas as pd

df1 = pd.DataFrame({'Node1': ['A1','A1','A2','A2','B1','B2','A3','A3','B2'],
                    'Node1_Rev': ['1','2','1','2','1','2','2','3','1'],
                    'Node2': ['B1','B2','B1','B2','K5','T1','G7','H9','J1'],
                    'Node2_Rev': ['1','1', '1','2','1','1','2','1','3'],
                    'REF' : ['11','12','11','12','11','12','12','23','22'],
                    'DES' : ['QR','SG','QW','SG','QR','S1','G2A3','...','SSD']
}
)
  • 示例输出数据框:
Node0 Node0_Rev Node1 Node1_REV Node2 Node2_REV REF DES
A1 1 A1 1 B1 1 11 QR
A1 2 A1 2 B2 1 12 SG
A2 1 A2 1 B1 1 11 QR
A2 2 A2 2 B2 2 12 SG
A1 1 B1 1 K5 1 11 QR
A2 1 B1 1 K1 1 11 QR
A2 2 B2 2 T1 1 12 S1
A3 2 A3 2 G7 2 12 G2A3
A3 3 A3 3 H9 1 23 ...
A1 2 B2 1 J1 3 22 SSD

在这里输入图片描述

有什么有效的方法可以为更大的数据集生成输出吗?

https://stackoverflow.com/users/16343464/mozway 开了这个新问题,以增强 如何基于多层数据构建层次结构

2 个回答

0

接着我之前的回答,你可以调整代码,从元组而不是单个值来构建数据框(DataFrame):

import networkx as nx

G = nx.from_edgelist((zip(zip(df['Node1'], df['Node1_Rev']),
                          zip(df['Node2'], df['Node2_Rev']))),
                    create_using=nx.DiGraph)

roots = (v for v, d in G.in_degree() if d == 0)
leaves = [v for v, d in G.out_degree() if d == 0]

out = (pd.DataFrame(path for root in roots for path in
                    nx.all_simple_paths(G, root, leaves))
        .add_prefix('Node_')
      )

输出结果:

    Node_0   Node_1   Node_2
0  (A1, 1)  (B1, 1)  (K5, 1)
1  (A1, 2)  (B2, 1)  (J1, 3)
2  (A2, 1)  (B1, 1)  (K5, 1)
3  (A2, 2)  (B2, 2)  (T1, 1)
4  (A3, 2)  (G7, 2)     None
5  (A3, 3)  (H9, 1)     None

图表:

在这里输入图片描述


优化和以列的形式输出

为了提高效率,我们可以只测试连接组件的子图中的根节点和叶子节点的组合。为此,有两种方法:一种是先生成组、根和叶子,然后用集合操作进行过滤;另一种是动态生成组、根和叶子。哪种方法更好可能取决于图的结构、子图的数量等。

无论如何,如果有大的子图(连接组件),并且有多个根和叶子,那么每种组合都会生成一行,这样会迅速增加行数。例如,如果你有一个包含10个根和10个叶子的组,就会生成100行。

第一种方法:

import networkx as nx

G = nx.from_edgelist((zip(zip(df['Node1'], df['Node1_Rev']),
                          zip(df['Node2'], df['Node2_Rev']))),
                    create_using=nx.DiGraph)

roots = {v for v, d in G.in_degree() if d == 0}
leaves = {v for v, d in G.out_degree() if d == 0}

c = nx.weakly_connected_components(G)

out = (pd.concat(dict(enumerate(
                   pd.DataFrame(path, columns=['', '_rev'])
                    for g in c
                    for root in roots&g
                    for l in [leaves&g]
                    for path in
                    nx.all_simple_paths(G, root, l))
                     ), axis=1)
        .stack().T
      )
out.columns = out.columns.map(lambda x: f'Node{x[0]}{x[1]}')

第二种方法:

import networkx as nx

G = nx.from_edgelist((zip(zip(df['Node1'], df['Node1_Rev']),
                          zip(df['Node2'], df['Node2_Rev']))),
                    create_using=nx.DiGraph)

out = (pd.concat(dict(enumerate(
                   pd.DataFrame(path, columns=['', '_rev'])
                    for g in map(G.subgraph, nx.weakly_connected_components(G))
                    for root in (v for v, d in g.in_degree() if d == 0)
                    for path in
                    nx.all_simple_paths(G, root, {v for v, d in g.out_degree() if d == 0}))
                     ), axis=1)
        .stack().T
      )
out.columns = out.columns.map(lambda x: f'Node{x[0]}{x[1]}')

输出结果:

  Node0 Node0_rev Node1 Node1_rev Node2 Node2_rev
0    A1         1    B1         1    K5         1
1    A1         2    B2         1    J1         3
2    A2         1    B1         1    K5         1
3    A2         2    B2         2    T1         1
4    A3         2    G7         2   NaN       NaN
5    A3         3    H9         1   NaN       NaN
0

这里有一种方法可以把数据收集到你的有向网络中,不过这并不会生成最终的表格,因为它支持任意深度的图。

import io
import json

rows= """
| A1 | 1 | B1 | 1 | 11 | QR |
| A1 | 2 | B2 | 1 | 12 | SG |
| A2 | 1 | B1 | 1 | 11 | QR |
| A2 | 2 | B2 | 2 | 12 | SG |
| B1 | 1 | K5 | 1 | 11 | QR |
| B1 | 1 | K1 | 1 | 11 | QR |
| B2 | 2 | T1 | 1 | 12 | S1 |
| A3 | 2 | G7 | 2 | 12 | G2A3 |
| A3 | 3 | H9 | 1 | 23 | ... |
| B2 | 1 | J1 | 3 | 22 | SSD |
""".strip()

rows_collector = {}
with io.StringIO(rows) as file_in:
    for row in file_in:
        fields = [f.strip() for f in row.strip(" |").split("|")]
        parent_key = fields[0:2]
        parent_node, parent_rev = parent_key
        parent_key = "~".join(parent_key)

        child_key = fields[2:4]
        child_node, child_rev = child_key
        child_key = "~".join(child_key)

        child_data = fields[4:6]

        child = rows_collector.setdefault(child_key, {
            "node": child_node,
            "rev": child_rev,
            "children": [],
        })
        child["REF"] = child_data[0]
        child["DES"] = child_data[1]
        child["has_parent"] = True

        parent = rows_collector.setdefault(parent_key, {
            "node": parent_node,
            "rev": parent_rev,
            "REF" : None,
            "DES" : None,
            "children" : [],
            "has_parent": False
        })
        parent["children"].append(child)

for key, value in sorted(rows_collector.items()):
    if not value["has_parent"]:
        print(json.dumps(value, indent=4))

这样你就能得到:

{
    "node": "A1",
    "rev": "1",
    "REF": null,
    "DES": null,
    "children": [
        {
            "node": "B1",
            "rev": "1",
            "children": [
                {
                    "node": "K5",
                    "rev": "1",
                    "children": [],
                    "REF": "11",
                    "DES": "QR",
                    "has_parent": true
                },
                {
                    "node": "K1",
                    "rev": "1",
                    "children": [],
                    "REF": "11",
                    "DES": "QR",
                    "has_parent": true
                }
            ],
            "REF": "11",
            "DES": "QR",
            "has_parent": true
        }
    ],
    "has_parent": false
}
{
    "node": "A1",
    "rev": "2",
    "REF": null,
    "DES": null,
    "children": [
        {
            "node": "B2",
            "rev": "1",
            "children": [
                {
                    "node": "J1",
                    "rev": "3",
                    "children": [],
                    "REF": "22",
                    "DES": "SSD",
                    "has_parent": true
                }
            ],
            "REF": "12",
            "DES": "SG",
            "has_parent": true
        }
    ],
    "has_parent": false
}
{
    "node": "A2",
    "rev": "1",
    "REF": null,
    "DES": null,
    "children": [
        {
            "node": "B1",
            "rev": "1",
            "children": [
                {
                    "node": "K5",
                    "rev": "1",
                    "children": [],
                    "REF": "11",
                    "DES": "QR",
                    "has_parent": true
                },
                {
                    "node": "K1",
                    "rev": "1",
                    "children": [],
                    "REF": "11",
                    "DES": "QR",
                    "has_parent": true
                }
            ],
            "REF": "11",
            "DES": "QR",
            "has_parent": true
        }
    ],
    "has_parent": false
}
{
    "node": "A2",
    "rev": "2",
    "REF": null,
    "DES": null,
    "children": [
        {
            "node": "B2",
            "rev": "2",
            "children": [
                {
                    "node": "T1",
                    "rev": "1",
                    "children": [],
                    "REF": "12",
                    "DES": "S1",
                    "has_parent": true
                }
            ],
            "REF": "12",
            "DES": "SG",
            "has_parent": true
        }
    ],
    "has_parent": false
}
{
    "node": "A3",
    "rev": "2",
    "REF": null,
    "DES": null,
    "children": [
        {
            "node": "G7",
            "rev": "2",
            "children": [],
            "REF": "12",
            "DES": "G2A3",
            "has_parent": true
        }
    ],
    "has_parent": false
}
{
    "node": "A3",
    "rev": "3",
    "REF": null,
    "DES": null,
    "children": [
        {
            "node": "H9",
            "rev": "1",
            "children": [],
            "REF": "23",
            "DES": "...",
            "has_parent": true
        }
    ],
    "has_parent": false
}

撰写回答