使用数据框的多行多列进行向量化而非单行

0 投票

1 回答

34 浏览

数据工程师

提问于 2025-04-12 04:31

我现在正在制作一个包含历史股票信息的csv文件，这个文件不仅要有历史价格，还要包含一些动量指标。我已经成功地通过遍历整个数据框（有超过2500万行）来添加这些指标，但这个过程太慢了，花了30到36个小时。

我想实现的目标是：首先计算3天内的最高价：

high = stock_df.loc[x+1:x+4, "High"].max(axis=0)

然后把这个最高价除以第0天的最低价：

low = stock_df.loc[x, "Low"]

我希望能避免像下面这样遍历整个循环：

stock_ticker_list = df.Symbol.unique()
for ticker in stock_ticker_list:
    #return dataframe thats historical infor for one stock
    print(ticker)
    stock_df = df.loc[df.Symbol == ticker]
    start = stock_df.index[stock_df['Symbol'] == ticker][0]
    for x in range(start, start + len(stock_df) - 2):
        try:
            high = stock_df.loc[x+1:x+4, "High"].max(axis=0)  
            low =  stock_df.loc[x, "Low"] 
            df2.loc[x,"H/L"] = high/low
        except:
            df2.loc[x,"H/L"] = pd.NA

我查看了文档，发现了一些方法，比如pandas.Series.pct_change和pandas.Series.div，但这些函数似乎不能在不创建3天最高价列的情况下使用。我尝试创建一个3天最高价的列

s = stock_df["High"] 
stock_df['Three_day_high'] = max([s.diff(-1),s.diff(-2),s.diff(-3)]) + s

但遇到了一个值错误（ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().）

性能优化 pandas 数据框向量化动量指标历史股票信息最高价最低价

1 个回答

考虑一下窗口操作：https://pandas.pydata.org/pandas-docs/stable/user_guide/window.html

import pandas as pd
import numpy as np

# for testing, have generated a 50 million row dataframe with random numbers in the range 500 to 1500
l1 = np.round(500 + 1000 * np.random.rand(5000000,1), 0)
df = pd.DataFrame(l1, columns = ["Val"])

# rolling window applied to the Val column
%timeit df["Val"].rolling(window=3).max() / df["Val"]  # optional timing function
df["H"] = df["Val"].rolling(window=3, closed = 'left').max() # optional to show 3 day high
df["HL"] = df["Val"].rolling(window=3, closed = 'left').max() / df["Val"]
df[:7]

每次循环大约需要1.07秒，误差大约是10.4毫秒（这是7次运行的平均值和标准差，每次运行一次）

回答于 2025-04-12 由 Python大师

分享举报

使用数据框的多行多列进行向量化而非单行

1 个回答

撰写回答