如何从Pandas数据框生成完整层次图?

3 投票
2 回答
82 浏览
提问于 2025-04-14 18:15

有一个Pandas的数据框,里面存放着一些层级数据,也就是有上下级关系的数据。

data = pd.DataFrame({
"manager_id": ["A", "A", "B", "A", "C", "A", "B"],
"employee_id": ["B", "C", "C", "D", "E", "E", "E"]
} )

这些数据包含了每个经理的所有下属关系。比如说,对于每个经理的ID(比如“ A”),员工的ID包括了直接由经理“A”管理的员工(比如“ B”),以及由员工“ B”管理的员工(比如“ C”、“ D”)。那么,如何把这些关系表示成一个有向图呢?这个图的边应该是

[('A', 'B'), ('A', 'D'), ('B', 'C'), ('C', 'E')]

2 个回答

1

一开始我没有完全明白你想要实现什么,现在我已经想出了一个循环的方法来做到这一点:

import networkx as nx
import pandas as pd

man_id ='A'
NodeCons=[]
employees =[]
data = pd.DataFrame({
"manager_id": ["A", "A", "B", "A", "C", "A", "B"],
"employee_id": ["B", "C", "C", "D", "E", "E", "E"]
} )

def searchEmployees(man):
    employees =[]
    for index,row in data.iterrows():
        
        if row['manager_id'] == man:
            NodeCons.append((row['manager_id'],row['employee_id']))
            
            employees.append(row['employee_id'])
     
            searchEmployees(row['employee_id'])
       
searchEmployees(man_id) 

然后从这个列表中创建一个图:

G = nx.Graph()
G.add_edges_from(NodeCons)


nx.draw_networkx(G)
2

首先,把数据框(DataFrame)当作一个有向图来读取,可以使用from_pandas_edgelist这个方法。接着,用topological_sort来对节点进行排序,只保留每个子节点中,父节点具有最高拓扑索引的边。

G = nx.from_pandas_edgelist(data, source='manager_id', target='employee_id',
                            create_using=nx.DiGraph)

# topological order
order = {n: i for i,n in enumerate(nx.topological_sort(G))}
# {'A': 0, 'B': 1, 'D': 2, 'C': 3, 'E': 4}

# for each node, only keep the parent that has the greatest topological index
parents = {}
for source, target in G.edges():
    p = parents.setdefault(target, source)
    if order[source] > order[p]:
        parents[target] = source

# parents
# {'C': 'B', 'E': 'C', 'B': 'A', 'D': 'A'}

# remove shorter edges
G.remove_edges_from(G.edges - set(zip(parents.values(), parents.keys())))
# or
# G = G.edge_subgraph(list(zip(parents.values(), parents.keys())))

输出结果:

networkx remove extra edges with topological sort

过滤前的图:

networkx input graph with duplicated paths

变体

你也可以使用topological_generations来计算一个顺序,这样同一代的节点会有相同的编号。

order = {n: i for i, l in enumerate(nx.topological_generations(G)) for n in l}
# {'A': 0, 'B': 1, 'D': 1, 'C': 2, 'E': 3}

关于你提的另一个问题,你也可以计算相对代际差异,只保留两个节点之间代际差异为1的边。

G = nx.from_pandas_edgelist(data, source='manager_id', target='employee_id',
                            create_using=nx.DiGraph)

order = {n: i for i, l in enumerate(nx.topological_generations(G)) for n in l}
# {'A': 0, 'B': 1, 'D': 1, 'C': 2, 'E': 3}

# compute the relative generation difference
# keep the edges with a difference of 1
keep = [e for e in G.edges if order[e[1]]-order[e[0]] == 1]
# [('A', 'B'), ('A', 'D'), ('B', 'C'), ('C', 'E'), ('F', 'G')]

G = G.edge_subgraph(list(zip(parents.values(), parents.keys())))

撰写回答