用Pandas或Pypark在两列中表示的关系展平“树”

2024-05-23 20:39:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我收集的家谱格式与以下类似:

    A
   / \
  B   C
 / \ / \
D  E F  G
  / \
 .. ..

用以下两列(包含多棵树)表示:

^{tb1}$

什么是最有效的方法来平展它,以便在一个新列中我得到最上层的父级

即B=A,D=A

^{tb2}$

理想情况下,我希望在Spark中这样做(考虑到数据集的大小),但也可以尝试Pandas

如果没有对每一层都使用非常密集的递归函数(即使我的树最多有3层深),我目前无法高效地完成这项工作


Tags: 数据方法pandas格式情况spark理想家谱
3条回答

在熊猫中,您可以使用networkx进行检查

df=df.dropna()
import networkx as nx
G=nx.from_pandas_edgelist(df, 'parent', 'child',create_using=nx.DiGraph())
def find_root(G,node):
    if len(list(G.predecessors(node)))>0:
        root = find_root(G,list(G.predecessors(node))[0])
    else:
        root = node
    return root

df['child'].apply(lambda x : find_root(G,x))

Out[109]: 
1    A
2    A
3    A

pyspark中有一种方法,只有3个级别。请注意,在示例中,最后一行有4个级别,但失败了,希望不是您的案例,而是看到它

import pandas as pd
import pyspark.sql.functions as F

# create toy data
pdf = pd.DataFrame({'child':list('ABCDEFGHIJKLM'), 
                    'parent':['','A','A','B','B','C','C','E', '', 'I','J','K','L']})
# convert to spark dataframe
df = spark.createDataFrame(pdf)

# coalesce the column parent
df = df.withColumn('parent', F.when(F.col('parent')!='', F.col('parent'))
                              .otherwise(F.col('child')))

# do self join using alias to direct to the right columns
res = (
    df.alias('df1')
      .join(df.alias('df2'), F.col('df1.parent') == F.col('df2.child'))
      .join(df.alias('df3'), F.col('df2.parent') == F.col('df3.child'))
      .select(['df1.child', 'df1.parent',F.col('df3.parent').alias('highest_parent')])
)

你得到了什么

res.orderBy('child').show()
+  -+   +       +
|child|parent|highest_parent|
+  -+   +       +
|    A|     A|             A|
|    B|     A|             A|
|    C|     A|             A|
|    D|     B|             A|
|    E|     B|             A|
|    F|     C|             A|
|    G|     C|             A|
|    H|     E|             A|
|    I|     I|             I|
|    J|     I|             I|
|    K|     J|             I|
|    L|     K|             I|
|    M|     L|             J| < this one 4 levels so fail, could add a join if needed
+  -+   +       +

enter image description here

import pandas as pd
import numpy as np
df = pd.DataFrame({
    "child": ['A','B','C','D','E','F','G','H','I','A1','B1','C1','D1','E1','F1','G1','H1','I1'],
    "parent": [np.NaN,'A','A','B','B','C','C','G','G',np.NaN,'A1','A1','B1','B1','C1','C1','G1','G1']
})
upper_parent_list = list(df[df['parent'].isna()]['child'])
['A', 'A1']
df['upper_parent'] = df['parent'].fillna(df['child'])

   child parent upper_parent
0      A    NaN            A
1      B      A            A
2      C      A            A
3      D      B            B
4      E      B            B
5      F      C            C
6      G      C            C
7      H      G            G
8      I      G            G
9     A1    NaN           A1
10    B1     A1           A1
11    C1     A1           A1
12    D1     B1           B1
13    E1     B1           B1
14    F1     C1           C1
15    G1     C1           C1
16    H1     G1           G1
17    I1     G1           G1
while df['upper_parent'].isin(upper_parent_list).sum()!=df.shape[0]:
    for up_par in upper_parent_list:
        child_list = list(df[df['upper_parent'].isin([up_par])]['child'])
        df['upper_parent'] = np.where(df['parent'].isin(child_list), up_par, df['upper_parent'])
print(df)

   child parent upper_parent
0      A    NaN            A
1      B      A            A
2      C      A            A
3      D      B            A
4      E      B            A
5      F      C            A
6      G      C            A
7      H      G            A
8      I      G            A
9     A1    NaN           A1
10    B1     A1           A1
11    C1     A1           A1
12    D1     B1           A1
13    E1     B1           A1
14    F1     C1           A1
15    G1     C1           A1
16    H1     G1           A1
17    I1     G1           A1

相关问题 更多 >