将多个行合并为一个数组在pandas dataframe中
假设我有一个数据表(DataFrame),它的样子是这样的:
In [41]: df.columns
Out[41]: Index([u'Date Time', u'Open', u'High', u'Low', u'Last'], dtype='object')
In [42]: df
Out[42]:
Date Time Open High Low Last
0 12/02/2007 23:23:00 1443.75 1444.00 1443.75 1444.00
1 12/02/2007 23:25:00 1444.00 1444.00 1444.00 1444.00
2 12/02/2007 23:26:00 1444.25 1444.25 1444.25 1444.25
3 12/02/2007 23:27:00 1444.25 1444.25 1444.25 1444.25
4 12/02/2007 23:28:00 1444.25 1444.25 1444.25 1444.25
5 12/02/2007 23:29:00 1444.25 1444.25 1444.00 1444.00
6 12/02/2007 23:30:00 1444.25 1444.25 1444.00 1444.00
7 12/02/2007 23:31:00 1444.25 1444.25 1443.75 1444.00
8 12/02/2007 23:32:00 1444.00 1444.00 1443.75 1443.75
9 12/02/2007 23:33:00 1444.00 1444.00 1443.50 1443.50
我想创建一个数组,把当前行的“日期时间”这一列和这行以及前面n行的其他列关联起来。比如,当索引为9,n为2时,目标结果会把这些行:
7 12/02/2007 23:31:00 1444.25 1444.25 1443.75 1444.00
8 12/02/2007 23:32:00 1444.00 1444.00 1443.75 1443.75
9 12/02/2007 23:33:00 1444.00 1444.00 1443.50 1443.50
转换成一个列表,里面的值是这样的:索引1到4来自第9行,5到8来自第8行,9到12来自第7行:
['12/02/2007 23:33:00', 1444.00, 1444.00, 1443.50, 1443.50, 1444.00, 1444.00, 1443.75, 1443.75, 1444.25, 1444.25, 1443.75, 1444.00]
我相信我可以很容易地遍历数据表的切片来创建这个数组,但我希望能找到一种更高效的方法来做到这一点。
补充说明:
这里有一些代码可以生成我想要的结果。有几条回复建议我可以看看rolling_apply或rolling_window函数,但我还没弄明白它们是怎么工作的。
import pandas as pd
import numpy as np
data = pd.DataFrame([
['12/02/2007 23:23:00', 1443.75, 1444.00, 1443.75, 1444.00],
['12/02/2007 23:25:00', 1444.00, 1444.00, 1444.00, 1444.00],
['12/02/2007 23:26:00', 1444.25, 1444.25, 1444.25, 1444.25],
['12/02/2007 23:27:00', 1444.25, 1444.25, 1444.25, 1444.25],
['12/02/2007 23:28:00', 1444.25, 1444.25, 1444.25, 1444.25],
['12/02/2007 23:29:00', 1444.25, 1444.25, 1444.00, 1444.00],
['12/02/2007 23:30:00', 1444.25, 1444.25, 1444.00, 1444.00],
['12/02/2007 23:31:00', 1444.25, 1444.25, 1443.75, 1444.00],
['12/02/2007 23:32:00', 1444.00, 1444.00, 1443.75, 1443.75],
['12/02/2007 23:33:00', 1444.00, 1444.00, 1443.50, 1443.50]
])
window_size = 6
# Prime the DataFrame using the date as the index
result = pd.DataFrame(
[data.iloc[0:window_size, 1:].values.flatten()],
[data.iloc[window_size - 1, 0]])
for t in data.iloc[window_size:, 1:].itertuples(index=True):
# drop the oldest values and append the new ones
new_features = result.tail(1).iloc[:, 4:].values.flatten()
new_features = np.append(new_features, list(t[1:]), 0)
# turn it into a DataFrame and append it to the ongoing result
new_df = pd.DataFrame([new_features], [t[0]])
result = result.append(new_df)
这个方法不是很快,所以我仍然希望找到改进它的方法。
2 个回答
0
这个简单的函数对我有效
import itertools
def collapse(df, index_loc, number):
return list(itertools.chain(*[list(df.loc[x].values) for x in xrange(index_loc - number, index_loc + 1)]))
这里的 df 是你的数据框,index_loc 是起始索引(假设你用的是整数索引,就像例子里那样),number 是你的 'n'。这个函数通过使用 values
方法,获取数据框中每个索引点的值,然后把这些值连接成一个列表……
0
这里有一些代码,可以生成我想要的结果。有几个回复提到我可以看看rolling_apply或rolling_window这两个函数,但我没弄明白它们是怎么工作的。
import pandas as pd
import numpy as np
data = pd.DataFrame([
['12/02/2007 23:23:00', 1443.75, 1444.00, 1443.75, 1444.00],
['12/02/2007 23:25:00', 1444.00, 1444.00, 1444.00, 1444.00],
['12/02/2007 23:26:00', 1444.25, 1444.25, 1444.25, 1444.25],
['12/02/2007 23:27:00', 1444.25, 1444.25, 1444.25, 1444.25],
['12/02/2007 23:28:00', 1444.25, 1444.25, 1444.25, 1444.25],
['12/02/2007 23:29:00', 1444.25, 1444.25, 1444.00, 1444.00],
['12/02/2007 23:30:00', 1444.25, 1444.25, 1444.00, 1444.00],
['12/02/2007 23:31:00', 1444.25, 1444.25, 1443.75, 1444.00],
['12/02/2007 23:32:00', 1444.00, 1444.00, 1443.75, 1443.75],
['12/02/2007 23:33:00', 1444.00, 1444.00, 1443.50, 1443.50]
])
window_size = 6
# Prime the DataFrame using the date as the index
result = pd.DataFrame(
[data.iloc[0:window_size, 1:].values.flatten()],
[data.iloc[window_size - 1, 0]])
for t in data.iloc[window_size:, 1:].itertuples(index=True):
# drop the oldest values and append the new ones
new_features = result.tail(1).iloc[:, 4:].values.flatten()
new_features = np.append(new_features, list(t[1:]), 0)
# turn it into a DataFrame and append it to the ongoing result
new_df = pd.DataFrame([new_features], [t[0]])
result = result.append(new_df)
这可能不是特别高效,但它解决了我的问题。