使用数据框的多行多列进行向量化而非单行

0 投票
1 回答
34 浏览
提问于 2025-04-12 04:31

我现在正在制作一个包含历史股票信息的csv文件,这个文件不仅要有历史价格,还要包含一些动量指标。我已经成功地通过遍历整个数据框(有超过2500万行)来添加这些指标,但这个过程太慢了,花了30到36个小时。

我想实现的目标是:首先计算3天内的最高价:

high = stock_df.loc[x+1:x+4, "High"].max(axis=0)

然后把这个最高价除以第0天的最低价:

low = stock_df.loc[x, "Low"]

我希望能避免像下面这样遍历整个循环:

stock_ticker_list = df.Symbol.unique()
for ticker in stock_ticker_list:
    #return dataframe thats historical infor for one stock
    print(ticker)
    stock_df = df.loc[df.Symbol == ticker]
    start = stock_df.index[stock_df['Symbol'] == ticker][0]
    for x in range(start, start + len(stock_df) - 2):
        try:
            high = stock_df.loc[x+1:x+4, "High"].max(axis=0)  
            low =  stock_df.loc[x, "Low"] 
            df2.loc[x,"H/L"] = high/low
        except:
            df2.loc[x,"H/L"] = pd.NA

我查看了文档,发现了一些方法,比如pandas.Series.pct_change和pandas.Series.div,但这些函数似乎不能在不创建3天最高价列的情况下使用。我尝试创建一个3天最高价的列

s = stock_df["High"] 
stock_df['Three_day_high'] = max([s.diff(-1),s.diff(-2),s.diff(-3)]) + s

但遇到了一个值错误(ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

1 个回答

0

考虑一下窗口操作:https://pandas.pydata.org/pandas-docs/stable/user_guide/window.html

import pandas as pd
import numpy as np

# for testing, have generated a 50 million row dataframe with random numbers in the range 500 to 1500
l1 = np.round(500 + 1000 * np.random.rand(5000000,1), 0)
df = pd.DataFrame(l1, columns = ["Val"])

# rolling window applied to the Val column
%timeit df["Val"].rolling(window=3).max() / df["Val"]  # optional timing function
df["H"] = df["Val"].rolling(window=3, closed = 'left').max() # optional to show 3 day high
df["HL"] = df["Val"].rolling(window=3, closed = 'left').max() / df["Val"]
df[:7]

每次循环大约需要1.07秒,误差大约是10.4毫秒(这是7次运行的平均值和标准差,每次运行一次)在这里输入图片描述

撰写回答