使用数据框的多行多列进行向量化而非单行
我现在正在制作一个包含历史股票信息的csv文件,这个文件不仅要有历史价格,还要包含一些动量指标。我已经成功地通过遍历整个数据框(有超过2500万行)来添加这些指标,但这个过程太慢了,花了30到36个小时。
我想实现的目标是:首先计算3天内的最高价:
high = stock_df.loc[x+1:x+4, "High"].max(axis=0)
然后把这个最高价除以第0天的最低价:
low = stock_df.loc[x, "Low"]
我希望能避免像下面这样遍历整个循环:
stock_ticker_list = df.Symbol.unique()
for ticker in stock_ticker_list:
#return dataframe thats historical infor for one stock
print(ticker)
stock_df = df.loc[df.Symbol == ticker]
start = stock_df.index[stock_df['Symbol'] == ticker][0]
for x in range(start, start + len(stock_df) - 2):
try:
high = stock_df.loc[x+1:x+4, "High"].max(axis=0)
low = stock_df.loc[x, "Low"]
df2.loc[x,"H/L"] = high/low
except:
df2.loc[x,"H/L"] = pd.NA
我查看了文档,发现了一些方法,比如pandas.Series.pct_change和pandas.Series.div,但这些函数似乎不能在不创建3天最高价列的情况下使用。我尝试创建一个3天最高价的列
s = stock_df["High"]
stock_df['Three_day_high'] = max([s.diff(-1),s.diff(-2),s.diff(-3)]) + s
但遇到了一个值错误(ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
)
1 个回答
0
考虑一下窗口操作:https://pandas.pydata.org/pandas-docs/stable/user_guide/window.html
import pandas as pd
import numpy as np
# for testing, have generated a 50 million row dataframe with random numbers in the range 500 to 1500
l1 = np.round(500 + 1000 * np.random.rand(5000000,1), 0)
df = pd.DataFrame(l1, columns = ["Val"])
# rolling window applied to the Val column
%timeit df["Val"].rolling(window=3).max() / df["Val"] # optional timing function
df["H"] = df["Val"].rolling(window=3, closed = 'left').max() # optional to show 3 day high
df["HL"] = df["Val"].rolling(window=3, closed = 'left').max() / df["Val"]
df[:7]