在Pandas中使用动态列表查询

2 投票

2 回答

814 浏览

提问于 2025-04-18 04:35

假设我有几个列，里面记录了不同类型的利率（比如 "年利率"、"半年利率" 等等）。我想在我的数据表中用 query 来找出那些任意一个利率超过 1 的记录。

首先，我需要找出我想在查询中使用的列：

cols = [x for ix, x in enumerate(df.columns) if 'rate' in x]

假设，cols 包含：

["annual rate", "1/2 annual rate", "monthly rate"]

接下来，我想做类似这样的操作：

df.query('any of my cols > 1')

我该如何把这个格式化成 query 的形式呢？

数据查询数据分析 pandas 动态列表利率计算

2 个回答

像这样应该可以解决问题。

df.query('|'.join('(%s > 1)' % col for col in cols))

不过，我不太确定怎么处理列名中的空格，所以你可能需要重新命名它们。

回答于 2025-04-18 由 Python大师

分享举报

query 是用来完整解析一个 Python 表达式 的工具（不过有些限制，比如你不能使用 lambda 表达式或者三元 if/else 表达式）。这意味着你在查询字符串中提到的任何列名必须是有效的 Python 标识符（简单来说，就是“变量名”的正式说法）。检查这一点的一种方法是使用 tokenize 模块中的 Name 模式：

In [156]: tokenize.Name
Out[156]: '[a-zA-Z_]\\w*'

In [157]: def isidentifier(x):
   .....:     return re.match(tokenize.Name, x) is not None
   .....:

In [158]: isidentifier('adsf')
Out[158]: True

In [159]: isidentifier('1adsf')
Out[159]: False

现在，由于你的列名中有空格，每个用空格分开的单词会被当作不同的标识符来处理，所以你会得到类似于

df.query("annual rate > 1")

这样的内容，这在 Python 中是无效的语法。试着在 Python 解释器中输入 annual rate，你会得到一个 SyntaxError 错误。

总结一下：你需要把列名改成有效的变量名。除非你的列名遵循某种结构，否则你可能很难通过编程的方式做到这一点。在你的情况下，你可以这样做：

In [166]: cols
Out[166]: ['annual rate', '1/2 annual rate', 'monthly rate']

In [167]: list(map(lambda x: '_'.join(x.split()).replace('1/2', 'half'), cols))
Out[167]: ['annual_rate', 'half_annual_rate', 'monthly_rate']

然后你可以把查询字符串格式化得像 @acushner 的例子那样：

In [173]: newcols
Out[173]: ['annual_rate', 'half_annual_rate', 'monthly_rate']

In [174]: ' or '.join('%s > 1' % c for c in newcols)
Out[174]: 'annual_rate > 1 or half_annual_rate > 1 or monthly_rate > 1'

注意：其实你在这里并不需要使用 `query`：

In [180]: df = DataFrame(randn(10, 3), columns=cols)

In [181]: df
Out[181]:
   annual rate  1/2 annual rate  monthly rate
0      -0.6980           0.6322        2.5695
1      -0.1413          -0.3285       -0.9856
2       0.8189           0.7166       -1.4302
3       1.3300          -0.9596       -0.8934
4      -1.7545          -0.9635        2.8515
5      -1.1389           0.1055        0.5423
6       0.2788          -1.3973       -0.9073
7      -1.8570           1.3781        0.0501
8      -0.6842          -0.2012       -0.5083
9      -0.3270          -1.5280        0.2251

[10 rows x 3 columns]

In [182]: df.gt(1).any(1)
Out[182]:
0     True
1    False
2    False
3     True
4     True
5    False
6    False
7     True
8    False
9    False
dtype: bool

In [183]: df[df.gt(1).any(1)]
Out[183]:
   annual rate  1/2 annual rate  monthly rate
0      -0.6980           0.6322        2.5695
3       1.3300          -0.9596       -0.8934
4      -1.7545          -0.9635        2.8515
7      -1.8570           1.3781        0.0501

[4 rows x 3 columns]

正如 @Jeff 在评论中提到的，你可以以一种笨拙的方式引用非标识符的列名：

pd.eval('df[df["annual rate"]>0]')

如果你想拯救小猫，我不建议你这样写代码。

回答于 2025-04-18 由 Python大师

分享举报

在Pandas中使用动态列表查询

2 个回答

注意：其实你在这里并不 需要 使用 query：

撰写回答

注意：其实你在这里并不需要使用 `query`：