如何根据数据帧列的序列关系对其进行分组

2024-03-28 21:15:38 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图根据它们在两列之间的顺序关系进行分组。在

d = {'df1':[10,20, 30, 60, 70, 40, 30, 70], 'df2':[20, 30, 40, 80, 70, 50, 90, 100]}

df = pd.DataFrame(data = d)
df

   df1  df2
0   10  20
1   20  30
2   30  40
3   60  80
4   80  70
5   40  50
6   30  90
7   70  100

我期望结果如下:

让它变得更多清除:-df1和df2有一个基于它们的序列的关系。例如,10与20有直接关系,10与30到20有间接关系。10与40、20、30有间接关系。再举一个例子,80和70有直接关系,和100到70有间接关系。这对其余的列值有效。在

^{pr2}$

我正在尝试使用下面的脚本,但未能成功。在

^{3}$

有人能帮上忙吗?如果在堆栈溢出处有类似的解决方案,请告诉我。在

编辑: 第一个答案与上面的例子完美地结合在一起,但是当我尝试使用我想要做的数据时,它不能正确地工作。我的真实数据如下所示。在

    df1 df2
0   10  20
1   10  30
2   10  80
3   10  90
4   10  120
5   10  140
6   10  170
7   20  180
8   30  40
9   30  165
10  30  175
11  40  20
12  40  50
13  50  60
14  60  70
15  70  180
16  80  180
17  90  100
18  100 110
19  110 180
20  120 130
21  130 180
22  140 150
23  150 160
24  160 165
25  165 180
26  165 200
27  170 175
28  175 180
29  175 200
30  180 190
31  190 200
32  200 210
33  210 220
34  220 230
35  230 240
36  240 -

Tags: 数据脚本dataframedfdata关系顺序堆栈
3条回答

一种可能的解决方案:

import pandas as pd
from itertools import chain

l1 = [10, 20, 30, 60, 80, 40, 30, 70]
l2 = [20, 30, 40, 80, 70, 50, 90, 100]

d = dict()
for i, j in zip(l1, l2):
    if i == j:
        continue
    d.setdefault(i, []).append(j)

for k in d:
    d[k].extend(chain.from_iterable(d.get(v, []) for v in d[k]))

df = pd.DataFrame({'df1': list(d.keys()), 'df2': [', '.join(str(v) for v in d[k]) for k in d]})
print(df)

印刷品:

^{pr2}$

编辑:基于新输入数据的其他解决方案。现在我正在检查路径中可能的圆:

import pandas as pd

data = '''
0   10  20
1   10  30
2   10  80
3   10  90
4   10  120
5   10  140
6   10  170
7   20  180
8   30  40
9   30  165
10  30  175
11  40  20
12  40  50
13  50  60
14  60  70
15  70  180
16  80  180
17  90  100
18  100 110
19  110 180
20  120 130
21  130 180
22  140 150
23  150 160
24  160 165
25  165 180
26  165 200
27  170 175
28  175 180
29  175 200
30  180 190
31  190 200
32  200 210
33  210 220
34  220 230
35  230 240
36  240 -
'''

df1, df2 = [], []
for line in data.splitlines()[:-1]: # < - get rid of last `-` character
    line = line.strip().split()
    if not line:
        continue

    df1.append(int(line[1]))
    df2.append(int(line[2]))

from pprint import pprint

d = dict()
for i, j in zip(df1, df2):
    if i == j:
        continue
    d.setdefault(i, []).append(j)

for k in d:
    seen = set()
    for v in d[k]:
        for val in d.get(v, []):
            if val not in seen:
                seen.add(val)
                d[k].append(val)


df = pd.DataFrame({'df1': list(d.keys()), 'df2': [', '.join(str(v) for v in d[k]) for k in d]})
print(df)

印刷品:

    df1                                                df2
0    10  20, 30, 80, 90, 120, 140, 170, 180, 40, 165, 1...
1    20                  180, 190, 200, 210, 220, 230, 240
2    30  40, 165, 175, 20, 50, 180, 200, 190, 210, 220,...
3    40  20, 50, 180, 190, 200, 210, 220, 230, 240, 60, 70
4    50          60, 70, 180, 190, 200, 210, 220, 230, 240
5    60              70, 180, 190, 200, 210, 220, 230, 240
6    70                  180, 190, 200, 210, 220, 230, 240
7    80                  180, 190, 200, 210, 220, 230, 240
8    90        100, 110, 180, 190, 200, 210, 220, 230, 240
9   100             110, 180, 190, 200, 210, 220, 230, 240
10  110                  180, 190, 200, 210, 220, 230, 240
11  120             130, 180, 190, 200, 210, 220, 230, 240
12  130                  180, 190, 200, 210, 220, 230, 240
13  140   150, 160, 165, 180, 200, 190, 210, 220, 230, 240
14  150        160, 165, 180, 200, 190, 210, 220, 230, 240
15  160             165, 180, 200, 190, 210, 220, 230, 240
16  165             180, 200, 190, 210, 200, 220, 230, 240
17  170             175, 180, 200, 190, 210, 220, 230, 240
18  175             180, 200, 190, 210, 200, 220, 230, 240
19  180                       190, 200, 210, 220, 230, 240
20  190                            200, 210, 220, 230, 240
21  200                                 210, 220, 230, 240
22  210                                      220, 230, 240
23  220                                           230, 240
24  230                                                240

pprint(d, width=250)

{10: [20, 30, 80, 90, 120, 140, 170, 180, 40, 165, 175, 100, 130, 150, 190, 20, 50, 200, 110, 160, 60, 210, 70, 220, 230, 240],
 20: [180, 190, 200, 210, 220, 230, 240],
 30: [40, 165, 175, 20, 50, 180, 200, 190, 210, 220, 230, 240, 60, 70],
 40: [20, 50, 180, 190, 200, 210, 220, 230, 240, 60, 70],
 50: [60, 70, 180, 190, 200, 210, 220, 230, 240],
 60: [70, 180, 190, 200, 210, 220, 230, 240],
 70: [180, 190, 200, 210, 220, 230, 240],
 80: [180, 190, 200, 210, 220, 230, 240],
 90: [100, 110, 180, 190, 200, 210, 220, 230, 240],
 100: [110, 180, 190, 200, 210, 220, 230, 240],
 110: [180, 190, 200, 210, 220, 230, 240],
 120: [130, 180, 190, 200, 210, 220, 230, 240],
 130: [180, 190, 200, 210, 220, 230, 240],
 140: [150, 160, 165, 180, 200, 190, 210, 220, 230, 240],
 150: [160, 165, 180, 200, 190, 210, 220, 230, 240],
 160: [165, 180, 200, 190, 210, 220, 230, 240],
 165: [180, 200, 190, 210, 200, 220, 230, 240],
 170: [175, 180, 200, 190, 210, 220, 230, 240],
 175: [180, 200, 190, 210, 200, 220, 230, 240],
 180: [190, 200, 210, 220, 230, 240],
 190: [200, 210, 220, 230, 240],
 200: [210, 220, 230, 240],
 210: [220, 230, 240],
 220: [230, 240],
 230: [240]}

编辑2:如果df是带有“df1”和“df2”列的输入数据帧:

from pprint import pprint

d = dict()
for i, j in zip(df.df1, df.df2):
    if i == j:
        continue
    if j == '-':   # <  this will remove the `-` character in df2
        continue
    d.setdefault(i, []).append(j)

for k in d:
    seen = set()
    for v in d[k]:
        for val in d.get(v, []):
            if val not in seen:
                seen.add(val)
                d[k].append(val)


df = pd.DataFrame({'df1': list(d.keys()), 'df2': [', '.join(str(v) for v in d[k]) for k in d]})
print(df)

这应该可以做到:

def recursive_walk(df, node):
    parents=df.loc[(df['df1']==node) & (df['df2']!=node), 'df2'].tolist()
    if(len(parents)==0):
        yield node
    else:
        for parent in parents:
            yield parent
            lst=[el for el in recursive_walk(df, parent)]
            for el in lst:
                yield el

df['tree']=df.apply(lambda x: list(set([el for el in recursive_walk(df, x['df2'])]+[x['df2']])), axis=1)

输出:

^{pr2}$

(*)我还检查了extended dataframe-它非常快,我不会共享输出,因为我的IDE正在截断它;)

嗨,谢谢你的澄清,我有一个递归函数的解决方案,你可以试试。对于大数据帧可能效率不高,但似乎效果良好。 该函数返回一个列表,但您可以编辑结果序列,以将该列表联接为字符串。在

def get_related(df1, related):
    # get directly related values
    next_vals = df.loc[df['df1'] == df1, 'df2'].values.tolist()
    # remove links to self (will cause recursion issues)
    next_vals = list(set(next_vals) - set([df1]))
    # add to running list
    related = related + next_vals
    # continue to next level
    if any(next_val in df['df1'].unique() for next_val in next_vals):
        for next_val in next_vals:
            related = related + get_related(next_val, related)
    # get unique list
    return list(set(related))

df['df1'].apply(lambda x: get_related(x, []))

相关问题 更多 >