分组数据帧上的高效操作

2条回答

网友

1楼 · 编辑于 2024-05-14 21:54:31

如果速度是你想要的，那么下面的应该是相当不错的，尽管它有点复杂，因为它在numpy中使用了复数排序。这类似于在包^{}中编写聚合排序方法时使用的方法（my me）。你知道吗

# get global sort order, for sorting by ID then price
full_idx = np.argsort(df['ID'] + 1j*df['price'])

# get min of full_idx for each ID (note that there are multiple ways of doing this)
n_for_id = np.bincount(df['ID'])
first_of_idx = np.cumsum(n_for_id)-n_for_id 

# subtract first_of_idx from full_idx
rank = np.empty(len(df),dtype=int)
rank[full_idx] = arange(len(df)) - first_of_idx[df['ID'][full_idx]]
df['rank'] = rank+1

在我的机器上，5m行需要2秒，这比使用pandas的groupby.rank快了大约100倍（尽管我实际上没有运行5m行的pandas版本，因为它需要太长的时间；我不确定@ayhan是如何在30秒内完成的，也许是pandas版本的不同？）。你知道吗

如果你使用这个，那么我建议彻底测试它，因为我没有。你知道吗

网友

2楼 · 编辑于 2024-05-14 21:54:31

您可以使用rank：

df["order"] = df.groupby("ID")["price"].rank(method="first")
df
Out[47]: 
   ID  price  order
0   1  100.0    3.0
1   1   80.0    1.0
2   1   90.0    2.0
3   2   40.0    1.0
4   2   40.0    2.0
5   2   50.0    3.0

在一个具有250000个ID（i5-3330）的5m行数据集上，大约需要30秒：

df = pd.DataFrame({"price": np.random.rand(5000000), "ID": np.random.choice(np.arange(250000), size = 5000000)})
%time df["order"] = df.groupby("ID")["price"].rank(method="first")
Wall time: 36.3 s

相关问题更多 >

编程相关推荐

热门问题

热门文章

分组数据帧上的高效操作

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >