如何从比较函数生成ngroups?

2 投票

1 回答

53 浏览

提问于 2025-04-14 15:18

假设我有一个函数，用来比较数据表中的行：

def comp(lhs: pandas.Series, rhs: pandas.Series) -> bool:
  if lhs.id == rhs.id:
    return True

  if abs(lhs.val1 - rhs.val1) < 1e-8:
    if abs(lhs.val2 - rhs.val2) < 1e-8:
      return True

  return False

现在我有一个数据表，里面有 id、val1 和 val2 这几列。我想生成一些组的编号，让任何两行如果通过 comp 函数比较结果为真，就属于同一个组。请问我该如何使用 pandas 来实现这个功能呢？我一直在想办法用 groupby 来做到这一点，但总是想不出合适的方法。

最小可重现示例：

example_input = pandas.DataFrame({
    'id' : [0, 1, 2, 2, 3],
    'value1' : [1.1, 1.2, 1.3, 1.4, 1.1],
    'value2' : [2.1, 2.2, 2.3, 2.4, 2.1]
})

example_output = example_input.copy()
example_output.index = [0, 1, 2, 2, 0]
example_output.index.name = 'groups'

数据处理 pandas 数据表数据分组比较函数

1 个回答

你想把一些行分成不同的组，要么是因为它们属于同一组，要么是因为它们之间的距离很近。为此，你可以使用 scipy.spatial.distance.pdist 来计算这些行之间的距离，从而找出那些距离很近的点。接着，你可以用 networkx 来创建一个图，帮助你识别哪些点是相互连接的：

import networkx as nx
import pandas as pd
from scipy.spatial.distance import pdist
from itertools import combinations

# example input
df = pandas.DataFrame({
    'id' : [0, 1, 2, 2, 3],
    'value1' : [1.1, 1.2, 1.3, 1.4, 1.1],
    'value2' : [2.1, 2.2, 2.3, 2.4, 2.1]
})

thresh = 1e-8
cols = ['value1', 'value2']

# create graph based on already connected ids
G = nx.compose_all(map(nx.path_graph, df.index.groupby(df['id']).values()))

# add pairs of values with distance below threshold as edges
G.add_edges_from(pd.Series(combinations(df.index, 2))
                 [pdist(df[cols])<thresh]
                )

# form groups based on the connected components
groups = {n: i for i, c in enumerate(nx.connected_components(G))
          for n in c}
# {0: 0, 4: 0, 1: 1, 2: 2, 3: 2}

# update index based on above dictionary
df.index = df.index.map(groups)

输出结果：

   id  value1  value2
0   0     1.1     2.1
1   1     1.2     2.2
2   2     1.3     2.3
2   2     1.4     2.4
0   3     1.1     2.1

图（仅基于ID；数字是原始索引）：

图（考虑距离后）：

回答于 2025-04-14 由 Python大师

分享举报

如何从比较函数生成ngroups?

1 个回答

撰写回答