Pandas `isin` 函数的更快替代方案

34 投票

2 回答

39941 浏览

提问于 2025-04-18 08:00

我有一个非常大的数据框 df，它的样子是这样的：

ID       Value1    Value2
1345      3.2      332
1355      2.2      32
2346      1.0      11
3456      8.9      322

我还有一个包含部分ID的列表 ID_list。我需要从 df 中提取出那些在 ID_list 里的 ID。

目前，我是用 df_sub=df[df.ID.isin(ID_list)] 这个方法来做的。但是这个过程花费了很多时间。因为 ID_list 中的 ID 没有任何规律，所以它们不在某个特定的范围内。（而且我还需要对很多类似的数据框进行同样的操作。我在想有没有更快的方法来做到这一点。如果把 ID 设置为索引，会不会有很大帮助？

谢谢！

性能优化数据处理数据提取数据框索引设置 isin函数

2 个回答

是的，isin确实比较慢。

与其使用这个，不如把ID设为索引，然后用loc来操作，这样会更快，比如：

df.set_index('ID', inplace=True)
df.loc[list_of_indices]

其实我来到这个页面是因为我需要在我的df中根据另一个df的索引创建一个标签：“如果df_1的索引和df_2的索引匹配，就标记为1，否则标记为NaN”，我这样做的：

df_2['label'] = 1  # Create a label column
df_1.join(df_2['label'])

这样做也非常快。

回答于 2025-04-18 由 Python大师

分享举报

编辑 2：这里有一个链接，提供了对各种 pandas 操作性能的最新研究，不过似乎还没有包含合并和连接的内容。

https://github.com/mm-mansour/Fast-Pandas

编辑 1：这些基准测试是针对一个相当旧的 pandas 版本，可能现在已经不太适用了。请查看下面 Mike 的评论，关于 merge 的内容。

这要看你的数据大小，对于大数据集来说，DataFrame.join 似乎是更好的选择。这要求你的 DataFrame 的索引是你的 'ID'，而你要连接的 Series 或 DataFrame 的索引是你的 'ID_list'。这个 Series 还必须有一个 name，这个名字会作为一个新字段被引入，叫做 name。你还需要指定一个内部连接，才能得到类似 isin 的效果，因为 join 默认是左连接。对于大数据集来说，使用 in 的查询语法似乎和 isin 的速度特性是一样的。

如果你处理的是小数据集，情况就不同了，实际上使用列表推导或者对字典进行应用会比使用 isin 更快。

否则，你可以尝试使用 Cython 来提高速度。

# I'm ignoring that the index is defaulting to a sequential number. You
# would need to explicitly assign your IDs to the index here, e.g.:
# >>> l_series.index = ID_list
mil = range(1000000)
l = mil
l_series = pd.Series(l)

df = pd.DataFrame(l_series, columns=['ID'])


In [247]: %timeit df[df.index.isin(l)]
1 loops, best of 3: 1.12 s per loop

In [248]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 549 ms per loop

# index vs column doesn't make a difference here
In [304]: %timeit df[df.ID.isin(l_series)]
1 loops, best of 3: 541 ms per loop

In [305]: %timeit df[df.index.isin(l_series)]
1 loops, best of 3: 529 ms per loop

# query 'in' syntax has the same performance as 'isin'
In [249]: %timeit df.query('index in @l')
1 loops, best of 3: 1.14 s per loop

In [250]: %timeit df.query('index in @l_series')
1 loops, best of 3: 564 ms per loop

# ID must be the index for DataFrame.join and l_series must have a name.
# join defaults to a left join so we need to specify inner for existence.
In [251]: %timeit df.join(l_series, how='inner')
10 loops, best of 3: 93.3 ms per loop

# Smaller datasets.
df = pd.DataFrame([1,2,3,4], columns=['ID'])
l = range(10000)
l_dict = dict(zip(l, l))
l_series = pd.Series(l)
l_series.name = 'ID_list'


In [363]: %timeit df.join(l_series, how='inner')
1000 loops, best of 3: 733 µs per loop

In [291]: %timeit df[df.ID.isin(l_dict)]
1000 loops, best of 3: 742 µs per loop

In [292]: %timeit df[df.ID.isin(l)]
1000 loops, best of 3: 771 µs per loop

In [294]: %timeit df[df.ID.isin(l_series)]
100 loops, best of 3: 2 ms per loop

# It's actually faster to use apply or a list comprehension for these small cases.
In [296]: %timeit df[[x in l_dict for x in df.ID]]
1000 loops, best of 3: 203 µs per loop

In [299]: %timeit df[df.ID.apply(lambda x: x in l_dict)]
1000 loops, best of 3: 297 µs per loop

回答于 2025-04-18 由 Python大师

分享举报

Pandas `isin` 函数的更快替代方案

2 个回答

撰写回答