如果列包含空值，Pandas查询不起作用

1 投票

1 回答

62 浏览

提问于 2025-04-14 17:03

我有一个Pandas的数据表，里面有一些空值，我想用query来筛选数据。

data = {'Title': ['Title1', 'Title2', 'Title3', 'Title4'],
        'Subjects': ['Math; Science', 'English; Math', pd.NA, 'English']}

df_test = pd.DataFrame(data)

print(df_test)
#     Title       Subjects
# 0  Title1  Math; Science
# 1  Title2  English; Math
# 2  Title3           <NA>
# 3  Title4        English

但是这个查询让我出错：

df_test.query('Title.str.startswith("T") and Subjects.str.contains("Math")')

KeyError                                  Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
    197             if self.has_resolvers:
--> 198                 return self.resolvers[key]
    199 

36 frames KeyError: 'Series_2_0xe00x4a0x2f0xf50x420x7a0x00x0'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
KeyError: 'Series_2_0xe00x4a0x2f0xf50x420x7a0x00x0'

The above exception was the direct cause of the following exception:

UndefinedVariableError                    Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/pandas/core/computation/scope.py in resolve(self, key, is_local)
    209                 return self.temps[key]
    210             except KeyError as err:
--> 211                 raise UndefinedVariableError(key, is_local) from err
    212 
    213     def swapkey(self, old_key: str, new_key: str, new_value=None) -> None:

UndefinedVariableError: name 'Series_2_0xe00x4a0x2f0xf50x420x7a0x00x0' is not defined

这个查询也是一样：

df_test.query('Title.str.startswith("T") and Subjects.notna() and Subjects.str.contains("Math")')

不过这个查询给了我想要的结果。

df_test[df_test['Subjects'].notna()].query('Title.str.startswith("T") and Subjects.str.contains("Math")')
    Title   Subjects
0   Title1  Math; Science
1   Title2  English; Math

我在想，这是不是query的一个限制，还是我做错了什么。

pd.__version__
# '1.5.3'

数据查询数据清洗数据分析 pandas 数据筛选空值处理

1 个回答

在用 Python 3.10（具体来说是 3.10.9）和 Pandas 2.2.1 测试时，我遇到了同样的错误。让人有点意外的是，@ewz93 在评论中似乎没有遇到这些版本的问题。

我可以建议两种解决方法：

在使用 Series.fillna 之前，先处理一下数据，然后再使用 Series.str.contains("Math")：

df_test.query('Title.str.startswith("T") and Subjects.fillna("").str.contains("Math")')

    Title       Subjects
0  Title1  Math; Science
1  Title2  English; Math

把 df.query 的引擎从默认的 numexpr 改成 python：

df_test.query('Title.str.startswith("T") and Subjects.str.contains("Math")', 
              engine='python')

    Title       Subjects
0  Title1  Math; Science
1  Title2  English; Math

注意

在 df.query 的文档中，他们建议不要使用 engine='python'，因为这样效率比使用 numexpr 作为引擎要低。不过，在对一个形状为 1_000_000, 2) 的 df 测试这两种方法时，第二种方法（反复测试）竟然稍微快了一点：

# `fillna` method
573 ms ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# `engine='python' method
556 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

回答于 2025-04-14 由 Python大师

分享举报

如果列包含空值，Pandas查询不起作用

1 个回答

撰写回答