如何在Pandas中遍历数据帧中的行？

3条回答

网友

1楼 · 编辑于 2024-04-22 13:44:29

首先考虑是否真的需要对数据帧中的行进行迭代。有关替代方案，请参见this answer。

如果仍然需要在行上迭代，可以使用下面的方法。注意一些在其他任何答案中都没有提到的重要注意事项。

for index, row in df.iterrows():
    print row["c1"], row["c2"]

DataFrame.itertuples()

for row in df.itertuples(index=True, name='Pandas'):
    print getattr(row, "c1"), getattr(row, "c2")

itertuples()应该比iterrows()快

但要注意，根据文件（熊猫目前为0.24.2）：

iterows:dtype可能在每行之间不匹配
Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally much faster than iterrows()
iterows:不修改行
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
使用DataFrame.apply()代替：
```
new_df = df.apply(lambda x: x * 2)
```
项目：
The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

有关详细信息，请参见pandas docs on iteration。

网友

2楼 · 编辑于 2024-04-22 13:44:29

DataFrame.iterrows是一个生成索引和行的生成器

import pandas as pd
import numpy as np

df = pd.DataFrame([{'c1':10, 'c2':100}, {'c1':11,'c2':110}, {'c1':12,'c2':120}])

for index, row in df.iterrows():
    print(row['c1'], row['c2'])

Output: 
   10 100
   11 110
   12 120

网友

3楼 · 编辑于 2024-04-22 13:44:29

How to iterate over rows in a DataFrame in Pandas?

回答：不要！

pandas中的迭代是一种反模式，只有在用尽所有其他选项时才应该这样做。您不应将任何名为“iter”的函数用于超过几千行的行，否则您将不得不习惯于等待。

是否要打印数据帧？使用^{}。

你想计算一些东西吗？在这种情况下，按此顺序搜索方法（从here修改的列表）：

矢量化
Cython程序
列表理解（普通循环）
^{}：i）可以在cython中执行的缩减，ii）python空间中的迭代
^{}和^{}
^{}

iterrows和itertuples（这两个函数在回答这个问题时都获得了很多投票）应该在非常罕见的情况下使用，例如生成用于顺序处理的行对象/名称元组，这实际上是这些函数唯一有用的功能。

向当局申诉 The docs page在迭代中有一个巨大的红色警告框，上面写着：

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed [...].

比循环快：Vectorization，Cython

很多基本操作和计算都是由pandas“矢量化”的（通过NumPy或通过cythodized函数）。这包括算术、比较（大多数）缩减、整形（如旋转）、联接和groupby操作。查看Essential Basic Functionality上的文档，找到适合您的问题的矢量化方法。

如果不存在，可以使用自定义cython extensions自行编写。

下一件好事：List Comprehensions

如果1）没有可用的矢量化解决方案，2）性能很重要，但不足以解决代码cythonizing的麻烦，3）您正在尝试对代码执行元素级转换，则列表理解应是下一个调用端口。这里有一个good amount of evidence来表明，对于许多常见的熊猫任务，列表理解足够快（有时甚至更快）。

公式很简单

# iterating over one column - `f` is some function that processes your data
result = [f(x) for x in df['col']]
# iterating over two columns, use `zip`
result = [f(x, y) for x, y in zip(df['col1'], df['col2'])]
# iterating over multiple columns
result = [f(row[0], ..., row[n]) for row in df[['col1', ...,'coln']].values]

如果可以将业务逻辑封装到函数中，则可以使用调用它的列表理解。您可以通过原始python的简单性和速度使任意复杂的事情工作。

明显的例子

让我们通过添加两个pandas列A + B的简单示例来演示这一区别。这是一个可矢量化的操作，因此很容易对比上面讨论的方法的性能。

Benchmarking code, for your reference.

不过，我要说的是，这并不总是那么枯燥。有时，“什么是一个操作的最佳方法”的答案是“这取决于您的数据”。我的建议是，在确定一种方法之前，先对数据测试不同的方法。

参考文献

10 Minutes to pandas和Essential Basic Functionality-向您介绍熊猫及其矢量化*/cythoded函数库的有用链接。
Enhancing Performance-文档中关于增强标准熊猫操作的入门知识
Are for-loops in pandas really bad? When should I care?-我写的一份详细的列表理解及其对各种操作的适用性（主要是涉及非数字数据的操作）
When should I ever want to use pandas apply() in my code?-apply是慢的（但没有像iter*家族那样慢）。然而，在某些情况下，可以（或应该）将apply视为一个系列的替代方案，特别是在某些GroupBy操作中）。

_{*Pandas字符串方法是“矢量化”的，因为它们是在序列上指定的，但在每个元素上操作。底层机制仍然是迭代的，因为字符串操作本身就很难矢量化。}

How to iterate over rows in a DataFrame in Pandas?

回答：不要！

比循环快：Vectorization，Cython

下一件好事：List Comprehensions

明显的例子

参考文献

相关问题更多 >

编程相关推荐

热门问题

热门文章