将多个行合并为一个数组在pandas dataframe中

-1 投票
2 回答
1857 浏览
提问于 2025-04-18 10:09

假设我有一个数据表(DataFrame),它的样子是这样的:

In [41]: df.columns
Out[41]: Index([u'Date Time', u'Open', u'High', u'Low', u'Last'], dtype='object')

In [42]: df
Out[42]: 
              Date Time     Open     High      Low     Last
0   12/02/2007 23:23:00  1443.75  1444.00  1443.75  1444.00
1   12/02/2007 23:25:00  1444.00  1444.00  1444.00  1444.00
2   12/02/2007 23:26:00  1444.25  1444.25  1444.25  1444.25
3   12/02/2007 23:27:00  1444.25  1444.25  1444.25  1444.25
4   12/02/2007 23:28:00  1444.25  1444.25  1444.25  1444.25
5   12/02/2007 23:29:00  1444.25  1444.25  1444.00  1444.00
6   12/02/2007 23:30:00  1444.25  1444.25  1444.00  1444.00
7   12/02/2007 23:31:00  1444.25  1444.25  1443.75  1444.00
8   12/02/2007 23:32:00  1444.00  1444.00  1443.75  1443.75
9   12/02/2007 23:33:00  1444.00  1444.00  1443.50  1443.50

我想创建一个数组,把当前行的“日期时间”这一列和这行以及前面n行的其他列关联起来。比如,当索引为9,n为2时,目标结果会把这些行:

7   12/02/2007 23:31:00  1444.25  1444.25  1443.75  1444.00
8   12/02/2007 23:32:00  1444.00  1444.00  1443.75  1443.75
9   12/02/2007 23:33:00  1444.00  1444.00  1443.50  1443.50

转换成一个列表,里面的值是这样的:索引1到4来自第9行,5到8来自第8行,9到12来自第7行:

['12/02/2007 23:33:00', 1444.00, 1444.00, 1443.50, 1443.50, 1444.00, 1444.00, 1443.75, 1443.75, 1444.25, 1444.25, 1443.75, 1444.00]

我相信我可以很容易地遍历数据表的切片来创建这个数组,但我希望能找到一种更高效的方法来做到这一点。

补充说明:

这里有一些代码可以生成我想要的结果。有几条回复建议我可以看看rolling_apply或rolling_window函数,但我还没弄明白它们是怎么工作的。

import pandas as pd
import numpy as np

data = pd.DataFrame([
    ['12/02/2007 23:23:00', 1443.75,  1444.00, 1443.75, 1444.00],
    ['12/02/2007 23:25:00', 1444.00,  1444.00, 1444.00, 1444.00],
    ['12/02/2007 23:26:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:27:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:28:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:29:00', 1444.25,  1444.25, 1444.00, 1444.00],
    ['12/02/2007 23:30:00', 1444.25,  1444.25, 1444.00, 1444.00],
    ['12/02/2007 23:31:00', 1444.25,  1444.25, 1443.75, 1444.00],
    ['12/02/2007 23:32:00', 1444.00,  1444.00, 1443.75, 1443.75],
    ['12/02/2007 23:33:00', 1444.00,  1444.00, 1443.50, 1443.50]
])

window_size = 6

# Prime the DataFrame using the date as the index
result = pd.DataFrame(
    [data.iloc[0:window_size, 1:].values.flatten()],
    [data.iloc[window_size - 1, 0]])

for t in data.iloc[window_size:, 1:].itertuples(index=True):
    # drop the oldest values and append the new ones
    new_features = result.tail(1).iloc[:, 4:].values.flatten()
    new_features = np.append(new_features, list(t[1:]), 0)
    # turn it into a DataFrame and append it to the ongoing result
    new_df = pd.DataFrame([new_features], [t[0]])
    result = result.append(new_df)

这个方法不是很快,所以我仍然希望找到改进它的方法。

2 个回答

0

这个简单的函数对我有效

import itertools
def collapse(df, index_loc, number):
    return list(itertools.chain(*[list(df.loc[x].values) for x in xrange(index_loc - number, index_loc + 1)]))

这里的 df 是你的数据框,index_loc 是起始索引(假设你用的是整数索引,就像例子里那样),number 是你的 'n'。这个函数通过使用 values 方法,获取数据框中每个索引点的值,然后把这些值连接成一个列表……

0

这里有一些代码,可以生成我想要的结果。有几个回复提到我可以看看rolling_apply或rolling_window这两个函数,但我没弄明白它们是怎么工作的。

import pandas as pd
import numpy as np

data = pd.DataFrame([
    ['12/02/2007 23:23:00', 1443.75,  1444.00, 1443.75, 1444.00],
    ['12/02/2007 23:25:00', 1444.00,  1444.00, 1444.00, 1444.00],
    ['12/02/2007 23:26:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:27:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:28:00', 1444.25,  1444.25, 1444.25, 1444.25],
    ['12/02/2007 23:29:00', 1444.25,  1444.25, 1444.00, 1444.00],
    ['12/02/2007 23:30:00', 1444.25,  1444.25, 1444.00, 1444.00],
    ['12/02/2007 23:31:00', 1444.25,  1444.25, 1443.75, 1444.00],
    ['12/02/2007 23:32:00', 1444.00,  1444.00, 1443.75, 1443.75],
    ['12/02/2007 23:33:00', 1444.00,  1444.00, 1443.50, 1443.50]
])

window_size = 6

# Prime the DataFrame using the date as the index
result = pd.DataFrame(
    [data.iloc[0:window_size, 1:].values.flatten()],
    [data.iloc[window_size - 1, 0]])

for t in data.iloc[window_size:, 1:].itertuples(index=True):
    # drop the oldest values and append the new ones
    new_features = result.tail(1).iloc[:, 4:].values.flatten()
    new_features = np.append(new_features, list(t[1:]), 0)
    # turn it into a DataFrame and append it to the ongoing result
    new_df = pd.DataFrame([new_features], [t[0]])
    result = result.append(new_df)

这可能不是特别高效,但它解决了我的问题。

撰写回答