使用Python基于多列生成图形结构
我有一个数据框,里面有以下几列:Node1、Node2、Node1_REV 和 Node2_REV。在这个结构中,Node1 是父节点,而 Node2 是子节点。Node1_REV 和 Node2_REV 则记录了各自节点的不同版本。子节点的值可能还有自己的子值,这些在 ChildNode 和 ChildNode_Rev 列中表示。
举个例子,B1/1、B2/1 和 B2/2 在 ChildNode/ChildNode_Rev 列中都有自己的子值。此外,像 B1/1 这样的值可能会出现在不同的父节点中,比如 A1/1 和 A2/1,而像 B2/1 的值可能会出现在 A1/2 中,B2/2 这样的不同版本可能会出现在另一个父节点 A2/2 中。
这种层级关系会一直延续,直到找到一个父节点,它不再作为任何其他值的子节点。
为了生成所需的输出,需要遍历所有的值,识别出所有可能的层级和组合,直到到达顶层节点。在检查可能的连接时,Node 和 Rev 的组合应该被视为唯一,而不仅仅是 Node 的值。
- 示例输入数据框:
Node1 | Node1_Rev | Node2 | Node2_REV | REF | DES |
---|---|---|---|---|---|
A1 | 1 | B1 | 1 | 11 | QR |
A1 | 2 | B2 | 1 | 12 | SG |
A2 | 1 | B1 | 1 | 11 | QR |
A2 | 2 | B2 | 2 | 12 | SG |
B1 | 1 | K5 | 1 | 11 | QR |
B1 | 1 | K1 | 1 | 11 | QR |
B2 | 2 | T1 | 1 | 12 | S1 |
A3 | 2 | G7 | 2 | 12 | G2A3 |
A3 | 3 | H9 | 1 | 23 | ... |
B2 | 1 | J1 | 3 | 22 | SSD |
import pandas as pd
df1 = pd.DataFrame({'Node1': ['A1','A1','A2','A2','B1','B2','A3','A3','B2'],
'Node1_Rev': ['1','2','1','2','1','2','2','3','1'],
'Node2': ['B1','B2','B1','B2','K5','T1','G7','H9','J1'],
'Node2_Rev': ['1','1', '1','2','1','1','2','1','3'],
'REF' : ['11','12','11','12','11','12','12','23','22'],
'DES' : ['QR','SG','QW','SG','QR','S1','G2A3','...','SSD']
}
)
- 示例输出数据框:
Node0 | Node0_Rev | Node1 | Node1_REV | Node2 | Node2_REV | REF | DES |
---|---|---|---|---|---|---|---|
A1 | 1 | A1 | 1 | B1 | 1 | 11 | QR |
A1 | 2 | A1 | 2 | B2 | 1 | 12 | SG |
A2 | 1 | A2 | 1 | B1 | 1 | 11 | QR |
A2 | 2 | A2 | 2 | B2 | 2 | 12 | SG |
A1 | 1 | B1 | 1 | K5 | 1 | 11 | QR |
A2 | 1 | B1 | 1 | K1 | 1 | 11 | QR |
A2 | 2 | B2 | 2 | T1 | 1 | 12 | S1 |
A3 | 2 | A3 | 2 | G7 | 2 | 12 | G2A3 |
A3 | 3 | A3 | 3 | H9 | 1 | 23 | ... |
A1 | 2 | B2 | 1 | J1 | 3 | 22 | SSD |
有什么有效的方法可以为更大的数据集生成输出吗?
https://stackoverflow.com/users/16343464/mozway 开了这个新问题,以增强 如何基于多层数据构建层次结构
2 个回答
接着我之前的回答,你可以调整代码,从元组而不是单个值来构建数据框(DataFrame):
import networkx as nx
G = nx.from_edgelist((zip(zip(df['Node1'], df['Node1_Rev']),
zip(df['Node2'], df['Node2_Rev']))),
create_using=nx.DiGraph)
roots = (v for v, d in G.in_degree() if d == 0)
leaves = [v for v, d in G.out_degree() if d == 0]
out = (pd.DataFrame(path for root in roots for path in
nx.all_simple_paths(G, root, leaves))
.add_prefix('Node_')
)
输出结果:
Node_0 Node_1 Node_2
0 (A1, 1) (B1, 1) (K5, 1)
1 (A1, 2) (B2, 1) (J1, 3)
2 (A2, 1) (B1, 1) (K5, 1)
3 (A2, 2) (B2, 2) (T1, 1)
4 (A3, 2) (G7, 2) None
5 (A3, 3) (H9, 1) None
图表:
优化和以列的形式输出
为了提高效率,我们可以只测试连接组件的子图中的根节点和叶子节点的组合。为此,有两种方法:一种是先生成组、根和叶子,然后用集合操作进行过滤;另一种是动态生成组、根和叶子。哪种方法更好可能取决于图的结构、子图的数量等。
无论如何,如果有大的子图(连接组件),并且有多个根和叶子,那么每种组合都会生成一行,这样会迅速增加行数。例如,如果你有一个包含10个根和10个叶子的组,就会生成100行。
第一种方法:
import networkx as nx
G = nx.from_edgelist((zip(zip(df['Node1'], df['Node1_Rev']),
zip(df['Node2'], df['Node2_Rev']))),
create_using=nx.DiGraph)
roots = {v for v, d in G.in_degree() if d == 0}
leaves = {v for v, d in G.out_degree() if d == 0}
c = nx.weakly_connected_components(G)
out = (pd.concat(dict(enumerate(
pd.DataFrame(path, columns=['', '_rev'])
for g in c
for root in roots&g
for l in [leaves&g]
for path in
nx.all_simple_paths(G, root, l))
), axis=1)
.stack().T
)
out.columns = out.columns.map(lambda x: f'Node{x[0]}{x[1]}')
第二种方法:
import networkx as nx
G = nx.from_edgelist((zip(zip(df['Node1'], df['Node1_Rev']),
zip(df['Node2'], df['Node2_Rev']))),
create_using=nx.DiGraph)
out = (pd.concat(dict(enumerate(
pd.DataFrame(path, columns=['', '_rev'])
for g in map(G.subgraph, nx.weakly_connected_components(G))
for root in (v for v, d in g.in_degree() if d == 0)
for path in
nx.all_simple_paths(G, root, {v for v, d in g.out_degree() if d == 0}))
), axis=1)
.stack().T
)
out.columns = out.columns.map(lambda x: f'Node{x[0]}{x[1]}')
输出结果:
Node0 Node0_rev Node1 Node1_rev Node2 Node2_rev
0 A1 1 B1 1 K5 1
1 A1 2 B2 1 J1 3
2 A2 1 B1 1 K5 1
3 A2 2 B2 2 T1 1
4 A3 2 G7 2 NaN NaN
5 A3 3 H9 1 NaN NaN
这里有一种方法可以把数据收集到你的有向网络中,不过这并不会生成最终的表格,因为它支持任意深度的图。
import io
import json
rows= """
| A1 | 1 | B1 | 1 | 11 | QR |
| A1 | 2 | B2 | 1 | 12 | SG |
| A2 | 1 | B1 | 1 | 11 | QR |
| A2 | 2 | B2 | 2 | 12 | SG |
| B1 | 1 | K5 | 1 | 11 | QR |
| B1 | 1 | K1 | 1 | 11 | QR |
| B2 | 2 | T1 | 1 | 12 | S1 |
| A3 | 2 | G7 | 2 | 12 | G2A3 |
| A3 | 3 | H9 | 1 | 23 | ... |
| B2 | 1 | J1 | 3 | 22 | SSD |
""".strip()
rows_collector = {}
with io.StringIO(rows) as file_in:
for row in file_in:
fields = [f.strip() for f in row.strip(" |").split("|")]
parent_key = fields[0:2]
parent_node, parent_rev = parent_key
parent_key = "~".join(parent_key)
child_key = fields[2:4]
child_node, child_rev = child_key
child_key = "~".join(child_key)
child_data = fields[4:6]
child = rows_collector.setdefault(child_key, {
"node": child_node,
"rev": child_rev,
"children": [],
})
child["REF"] = child_data[0]
child["DES"] = child_data[1]
child["has_parent"] = True
parent = rows_collector.setdefault(parent_key, {
"node": parent_node,
"rev": parent_rev,
"REF" : None,
"DES" : None,
"children" : [],
"has_parent": False
})
parent["children"].append(child)
for key, value in sorted(rows_collector.items()):
if not value["has_parent"]:
print(json.dumps(value, indent=4))
这样你就能得到:
{
"node": "A1",
"rev": "1",
"REF": null,
"DES": null,
"children": [
{
"node": "B1",
"rev": "1",
"children": [
{
"node": "K5",
"rev": "1",
"children": [],
"REF": "11",
"DES": "QR",
"has_parent": true
},
{
"node": "K1",
"rev": "1",
"children": [],
"REF": "11",
"DES": "QR",
"has_parent": true
}
],
"REF": "11",
"DES": "QR",
"has_parent": true
}
],
"has_parent": false
}
{
"node": "A1",
"rev": "2",
"REF": null,
"DES": null,
"children": [
{
"node": "B2",
"rev": "1",
"children": [
{
"node": "J1",
"rev": "3",
"children": [],
"REF": "22",
"DES": "SSD",
"has_parent": true
}
],
"REF": "12",
"DES": "SG",
"has_parent": true
}
],
"has_parent": false
}
{
"node": "A2",
"rev": "1",
"REF": null,
"DES": null,
"children": [
{
"node": "B1",
"rev": "1",
"children": [
{
"node": "K5",
"rev": "1",
"children": [],
"REF": "11",
"DES": "QR",
"has_parent": true
},
{
"node": "K1",
"rev": "1",
"children": [],
"REF": "11",
"DES": "QR",
"has_parent": true
}
],
"REF": "11",
"DES": "QR",
"has_parent": true
}
],
"has_parent": false
}
{
"node": "A2",
"rev": "2",
"REF": null,
"DES": null,
"children": [
{
"node": "B2",
"rev": "2",
"children": [
{
"node": "T1",
"rev": "1",
"children": [],
"REF": "12",
"DES": "S1",
"has_parent": true
}
],
"REF": "12",
"DES": "SG",
"has_parent": true
}
],
"has_parent": false
}
{
"node": "A3",
"rev": "2",
"REF": null,
"DES": null,
"children": [
{
"node": "G7",
"rev": "2",
"children": [],
"REF": "12",
"DES": "G2A3",
"has_parent": true
}
],
"has_parent": false
}
{
"node": "A3",
"rev": "3",
"REF": null,
"DES": null,
"children": [
{
"node": "H9",
"rev": "1",
"children": [],
"REF": "23",
"DES": "...",
"has_parent": true
}
],
"has_parent": false
}