将多个行合并为一个数组在pandas dataframe中

Question

假设我有一个数据表（DataFrame），它的样子是这样的：

In [41]: df.columns
Out[41]: Index([u'Date Time', u'Open', u'High', u'Low', u'Last'], dtype='object')

In [42]: df
Out[42]: 
              Date Time     Open     High      Low     Last
0   12/02/2007 23:23:00  1443.75  1444.00  1443.75  1444.00
1   12/02/2007 23:25:00  1444.00  1444.00  1444.00  1444.00
2   12/02/2007 23:26:00  1444.25  1444.25  1444.25  1444.25
3   12/02/2007 23:27:00  1444.25  1444.25  1444.25  1444.25
4   12/02/2007 23:28:00  1444.25  1444.25  1444.25  1444.25
5   12/02/2007 23:29:00  1444.25  1444.25  1444.00  1444.00
6   12/02/2007 23:30:00  1444.25  1444.25  1444.00  1444.00
7   12/02/2007 23:31:00  1444.25  1444.25  1443.75  1444.00
8   12/02/2007 23:32:00  1444.00  1444.00  1443.75  1443.75
9   12/02/2007 23:33:00  1444.00  1444.00  1443.50  1443.50

我想创建一个数组，把当前行的“日期时间”这一列和这行以及前面n行的其他列关联起来。比如，当索引为9，n为2时，目标结果会把这些行：

7   12/02/2007 23:31:00  1444.25  1444.25  1443.75  1444.00
8   12/02/2007 23:32:00  1444.00  1444.00  1443.75  1443.75
9   12/02/2007 23:33:00  1444.00  1444.00  1443.50  1443.50

转换成一个列表，里面的值是这样的：索引1到4来自第9行，5到8来自第8行，9到12来自第7行：

['12/02/2007 23:33:00', 1444.00, 1444.00, 1443.50, 1443.50, 1444.00, 1444.00, 1443.75, 1443.75, 1444.25, 1444.25, 1443.75, 1444.00]

我相信我可以很容易地遍历数据表的切片来创建这个数组，但我希望能找到一种更高效的方法来做到这一点。

补充说明：

这里有一些代码可以生成我想要的结果。有几条回复建议我可以看看rolling_apply或rolling_window函数，但我还没弄明白它们是怎么工作的。

import pandas as pd
import numpy as np

data = pd.DataFrame([
    ['12/02/2007 23:23:00', 1443.75,  1444.00, 1443.75, 1444.00],
    ['12/02/2007 23:25:00', 1444.00,  1444.00, 1444.00, 1444.00],
    ['12/02/2007 23:26:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:27:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:28:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:29:00', 1444.25,  1444.25, 1444.00, 1444.00],
    ['12/02/2007 23:30:00', 1444.25,  1444.25, 1444.00, 1444.00],
    ['12/02/2007 23:31:00', 1444.25,  1444.25, 1443.75, 1444.00],
    ['12/02/2007 23:32:00', 1444.00,  1444.00, 1443.75, 1443.75],
    ['12/02/2007 23:33:00', 1444.00,  1444.00, 1443.50, 1443.50]
])

window_size = 6

# Prime the DataFrame using the date as the index
result = pd.DataFrame(
    [data.iloc[0:window_size, 1:].values.flatten()],
    [data.iloc[window_size - 1, 0]])

for t in data.iloc[window_size:, 1:].itertuples(index=True):
    # drop the oldest values and append the new ones
    new_features = result.tail(1).iloc[:, 4:].values.flatten()
    new_features = np.append(new_features, list(t[1:]), 0)
    # turn it into a DataFrame and append it to the ongoing result
    new_df = pd.DataFrame([new_features], [t[0]])
    result = result.append(new_df)

这个方法不是很快，所以我仍然希望找到改进它的方法。

性能优化数据处理 pandas 数据帧数组合并数据切片 rolling_apply rolling_window

将多个行合并为一个数组在pandas dataframe中

2 个回答

撰写回答