在Python中使用Pandas处理父子路径层级

0 投票
2 回答
67 浏览
提问于 2025-04-14 16:06

我正在尝试在新的 pandas 列中获取子父层级关系。目前我的代码是这样的,但它会搜索所有的父项,而应该只搜索对应行的父项。

下面是它应该如何工作的示例。

基础表格(输入):

子项ID 父项ID
ITEM_1 ITEM_A
ITEM_2 ITEM_B
ITEM_A ITEM_X
ITEM_A ITEM_Y
ITEM_B ITEM_Z

第一次迭代

子项ID 父项ID 层级
ITEM_1 ITEM_A ITEM_1, ITEM_A
ITEM_2 ITEM_B ITEM_2, ITEM_B
ITEM_A ITEM_X ITEM_A, ITEM_X
ITEM_A ITEM_Y ITEM_A, ITEM_Y
ITEM_B ITEM_Z ITEM_B, ITEM_Z

第二次迭代(在这种情况下是期望的输出,因为没有更多的父项)

子项ID 父项ID 层级
ITEM_1 ITEM_A ITEM_1, ITEM_A, ITEM_X
ITEM_1 ITEM_A ITEM_1, ITEM_A, ITEM_Y
ITEM_2 ITEM_B ITEM_2, ITEM_B, ITEM_Z
ITEM_A ITEM_X ITEM_A, ITEM_X
ITEM_A ITEM_Y ITEM_A, ITEM_Y
ITEM_B ITEM_Z ITEM_B, ITEM_Z

我在实现这个结果时遇到困难,这是我的代码:

def get_parents(child_id):
    list_of_parents = []

    def dfs(child_id, parents_list):
        parent_ids = df[df["child_id"] == child_id]["parent_id"].values
        if len(parent_ids) == 0:
            return
        for parent_id in parent_ids:
            if parent_id not in parents_list:
                parents_list.append(parent_id)
                dfs(parent_id, parents_list)

    dfs(child_id, list_of_parents)
    return list_of_parents

df["parent_hierarchy"] = df["child_id"].apply(get_parents)

示例问题:

当尝试在整个数据框上运行代码时,会出现这个错误: NetworkXNoPath: 4614837_12_13200 和 4975995_5_13200 之间没有路径。

在过滤数据框只保留两边都有的 ID 后,代码可以运行,但仍然无法在输入数据框上工作。

父项ID 子项ID
4975995_5_13200 4789551_5_13200
5003373_3_13200 4614837_12_13200
4975995_5_13200 4602153_19_13200
4975995_5_13200 4789551_7_13200
4974894_5_13200 4614837_12_13200
4975995_5_13200 4789551_10_13200
4975995_5_13200 4789551_3_13200
4975995_5_13200 4789551_6_13200
4975995_5_13200 4789551_2_13200

用于草稿的 SQL 代码,但最终会用 Python 实现。

SELECT
CONCAT_WS(
  "|", 
  concat(
    bl1.parent_item_id 
  ), 
  IF(
    bl2.parent_item_id IS NULL, 
    NULL, 
    concat(
      bl2.parent_item_id 
    )
  ), 
  IF(
    bl3.parent_item_id IS NULL, 
    NULL, 
    concat(
      bl3.parent_item_id 
    )
  ), 
  IF(
    bl4.parent_item_id IS NULL, 
    NULL, 
    concat(
      bl4.parent_item_id 
    )
  ), 
  IF(
    bl5.parent_item_id IS NULL, 
    NULL, 
    concat(
      bl5.parent_item_id 
    )
  ), 
  IF(
    bl6.parent_item_id IS NULL, 
    NULL, 
    concat(
      bl6.parent_item_id 
    )
  ), 
  IF(
    bl7.parent_item_id IS NULL, 
    NULL, 
    concat(
      bl7.parent_item_id 
    )
  ), 
  IF(
    bl8.parent_item_id IS NULL, 
    NULL, 
    concat(
      bl8.parent_item_id 
    )
  ), 
  IF(
    bl9.parent_item_id IS NULL, 
    NULL, 
    concat(
      bl9.parent_item_id 
    )
  ), 
  IF(
    bl10.parent_item_id IS NULL, 
    NULL, 
    concat(
      bl10.parent_item_id 
    )
  )
) AS PATH -- joins to find 10 levels of items
FROM 
  baseln AS bl1 
  LEFT JOIN baseln AS bl2 ON bl1.parent_item_id = bl2.child_item_id 
  AND bl1.child_item_id IS NOT NULL 
  LEFT JOIN baseln AS bl3 ON bl2.parent_item_id = bl3.child_item_id 
  AND bl2.child_item_id IS NOT NULL 
  LEFT JOIN baseln AS bl4 ON bl2.parent_item_id = bl4.child_item_id 
  AND bl3.child_item_id IS NOT NULL 
  LEFT JOIN baseln AS bl5 ON bl2.parent_item_id = bl5.child_item_id 
  AND bl4.child_item_id IS NOT NULL 
  LEFT JOIN baseln AS bl6 ON bl2.parent_item_id = bl6.child_item_id 
  AND bl5.child_item_id IS NOT NULL 
  LEFT JOIN baseln AS bl7 ON bl2.parent_item_id = bl7.child_item_id 
  AND bl6.child_item_id IS NOT NULL 
  LEFT JOIN baseln AS bl8 ON bl2.parent_item_id = bl8.child_item_id 
  AND bl7.child_item_id IS NOT NULL 
  LEFT JOIN baseln AS bl9 ON bl2.parent_item_id = bl9.child_item_id 
  AND bl8.child_item_id IS NOT NULL 
  LEFT JOIN baseln AS bl10 ON bl2.parent_item_id = bl10.child_item_id 
  AND bl9.child_item_id IS NOT NULL 

NetworkX 在下面的数据框中确实抛出了一个错误

{'parent_id':['4974894_5_13200','5003373_3_13200','4974894_5_13200','4974894_5_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','4974894_5_13200','4974894_5_13200','5003373_3_13200','4974894_5_13200','4974894_5_13200','5003373_3_13200','4974894_5_13200','4974894_5_13200','4974894_5_13200','5003373_3_13200','5003373_3_13200','4974894_5_13200','5003373_3_13200','4975995_5_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','4974894_5_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','4974894_5_13200','4975995_5_13200','4974894_5_13200','5003373_3_13200','5003373_3_13200','4974894_5_13200','4974894_5_13200','4975995_5_13200','5003373_3_13200','4974894_5_13200','4974894_5_13200','5003373_3_13200','4974894_5_13200','4974894_5_13200','5003373_3_13200','5003373_3_13200','4974894_5_13200','4974894_5_13200','4974894_5_13200','5003373_3_13200','4975995_5_13200','5003373_3_13200','4974894_5_13200','5003373_3_13200','4974894_5_13200','4975995_5_13200','5003373_3_13200','4974894_5_13200','4974894_5_13200','5003373_3_13200','4975995_5_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200','4974894_5_13200','4975995_5_13200','5003373_3_13200','4974894_5_13200','4974894_5_13200','5003373_3_13200','5003373_3_13200','5003373_3_13200'],'child_id':['4602363_15_13200','4613145_13_13200','4602528_21_13200','4613145_11_13200','4614837_8_13200','4613145_11_13200','4613370_11_13200','4595322_17_13200','4602576_18_13200','4595310_14_13200','4595310_14_13200','4602528_23_13200','4602528_22_13200','4595322_16_13200','4595310_13_13200','4602384_16_13200','4602153_18_13200','4602384_17_13200','4602576_17_13200','4602528_18_13200','4789551_5_13200','4613046_18_13200','4602576_17_13200','4595310_9_13200','4602363_10_13200','4595322_16_13200','4602528_21_13200','4602171_17_13200','4602153_18_13200','4602153_12_13200','4595322_17_13200','4602576_12_13200','4613145_9_13200','4602576_11_13200','4602363_15_13200','4602171_12_13200','4602528_15_13200','4613370_15_13200','4614837_12_13200','4613370_14_13200','4602363_16_13200','4595322_11_13200','4595322_10_13200','4613370_15_13200','4602153_19_13200','4614837_11_13200','4613046_13_13200','4602384_10_13200','4602363_16_13200','4613370_14_13200','4789551_7_13200','4602171_10_13200','4595322_13_13200','4613046_18_13200','4602363_14_13200','4614837_12_13200','4602384_17_13200','4602576_16_13200','4613370_16_13200','4602528_18_13200','4602528_22_13200','4613145_14_13200','4602384_16_13200','4789551_10_13200','4602528_23_13200','4613046_17_13200','4602153_19_13200','4602363_14_13200','4789551_3_13200','4595310_13_13200','4602153_19_13200','4613370_16_13200','4602363_9_13200','4789551_6_13200','4602576_18_13200','4613145_14_13200','4595322_13_13200','4602384_13_13200','4602576_16_13200','4789551_2_13200','4614837_11_13200','4613145_13_13200','4602171_17_13200','4613046_17_13200','4614837_7_13200','4602153_11_13200']}

2 个回答

-2

使用数据框操作和递归函数的组合。这里有一段新代码,你在加载数据后应该能够实现这个。

def get_hierarchy(child_id, hierarchy=[]):
    hierarchy.append(child_id)
    parent_id = df.loc[df["CHILD_ID"] == child_id, "PARENT_ID"].values
    if len(parent_id) > 0:
        get_hierarchy(parent_id[0], hierarchy)
    return hierarchy

df["Hierarchy"] = df["CHILD_ID"].apply(get_hierarchy)
1

如果我理解得没错的话,可以使用 networkx 来创建层级结构:

import networkx as nx

# create graph
G = nx.from_pandas_edgelist(df, create_using=nx.DiGraph,
                            source='CHILD_ID', target='PARENT_ID')

out = {}
# for each subgraph, form pairs of nodes -> leaf
# find the path
for c in nx.weakly_connected_components(G):
    H = G.subgraph(c)
    leaves = {n for n, d in H.out_degree() if d==0}
    for n in c-leaves:
        for l in leaves:
            try:
                out.setdefault(n, []).append(nx.shortest_path(H, n, l))
            except nx.NetworkXNoPath:
                pass

# merge to unique CHILD_ID
# recreate PARENT_ID
# reorder based on original DataFrame
out = df.merge(df[['CHILD_ID']].drop_duplicates()
                 .merge(pd.Series(out, name='Hierarchy').explode(),
                        left_on='CHILD_ID', right_index=True)
                 .assign(PARENT_ID=lambda d: d['Hierarchy'].str[1]),
               how='left')

输出结果:

  CHILD_ID PARENT_ID                 Hierarchy
0   ITEM_1    ITEM_A  [ITEM_1, ITEM_A, ITEM_X]
1   ITEM_1    ITEM_A  [ITEM_1, ITEM_A, ITEM_Y]
2   ITEM_2    ITEM_B  [ITEM_2, ITEM_B, ITEM_Z]
3   ITEM_A    ITEM_X          [ITEM_A, ITEM_X]
4   ITEM_A    ITEM_Y          [ITEM_A, ITEM_Y]
5   ITEM_B    ITEM_Z          [ITEM_B, ITEM_Z]

图示:

在这里输入图片描述

撰写回答