Pandas `try df.loc[x]` 与 `x in df.index` 的区别

1 投票

1 回答

2358 浏览

提问于 2025-04-18 06:28

我有一个数据表，里面只有一列。我想写一个函数，输入一个键值后返回这一列的值；如果这个键不在索引里，就返回一个不同的（固定的）值。我能想到至少两种合理的方法来实现这个功能——除了速度之外，有没有什么理由让其中一种方法比另一种更好呢？

关于速度，假设数据表的长度是10,000，而要检查的ID数量是20,000，使用try/except的方法大约慢了2倍。这让我感到惊讶，因为另一种方法需要遍历索引两次。对于这种情况，有没有什么直观的解释呢？

使用一个 try/except 块

def attempt_1(id_val,df):
    try:
        return df.loc[id_val]
    except KeyError:
        return constant_val

%timeit [attempt_1(i,df) for i in ids_to_check]

1 loops, best of 3: 480 ms per loop

使用 in 来测试 id_val 是否在索引中

def attempt_2(id_val,df):
    if id_val in df.index:
        return df.loc[id_val]
    else:
        return constant_val

%timeit [attempt_2(i,df) for i in ids_to_check]

1 loops, best of 3: 235 ms per loop

性能优化错误处理数据处理函数设计数据查询数据分析数据索引数据帧

1 个回答

创建一个测试框架

In [22]: df = DataFrame(dict(A = np.random.randn(10000)))

选择一些ID

In [21]: ids_to_check = np.random.choice(np.arange(0,20000),size=10000,replace=False)

你的方法

In [18]: %timeit [attempt_2(i,df) for i in ids_to_check]
1 loops, best of 3: 409 ms per loop

In [16]: %timeit [attempt_1(i,df) for i in ids_to_check]
1 loops, best of 3: 620 ms per loop

一种高效的方法是使用向量化查找。isin这个函数会返回一个布尔数组，告诉你位置值是否在索引中；根据这个数组进行索引的速度非常快。

然后我重新索引，以恢复原来的索引，并用缺失条目的值进行填充

In [19]: %timeit df.A.loc[df.index.isin(ids_to_check)].reindex(df.index).fillna(-100)
100 loops, best of 3: 6.74 ms per loop

这会返回一个序列；其实也可以返回一个数据框

In [20]: df.A.loc[df.index.isin(np.random.choice(np.arange(0,20000),size=10000,replace=False))].reindex(df.index).fillna(-100)
Out[20]: 
0    -100.000000
1      -0.485421
2      -0.397338
3    -100.000000
4       0.573031
5    -100.000000
6       0.359699
7       0.298462
8    -100.000000
9      -1.274819
10   -100.000000
11      0.112869
12   -100.000000
13     -2.251186
14     -0.846211
...
9985   -100.000000
9986     -0.988055
9987     -0.080460
9988   -100.000000
9989      1.007490
9990     -1.454466
9991      0.875455
9992   -100.000000
9993   -100.000000
9994      0.194506
9995   -100.000000
9996   -100.000000
9997   -100.000000
9998     -0.477828
9999     -0.777487
Name: A, Length: 10000, dtype: float64

所以结论是，永远使用向量化的方法，绝不要使用循环。

回答于 2025-04-18 由 Python大师

分享举报

Pandas `try df.loc[x]` 与 `x in df.index` 的区别

1 个回答

撰写回答