如何用DataFram替换lambda和分组以提高性能

2024-03-29 06:55:04 发布

您现在位置:Python中文网/ 问答频道 /正文

也许我的问题看起来很复杂,但本质上很简单。我是Python新手,现在面临着代码太慢的问题。下面是代码的优化版本。我将非常感谢一个小的代码审查和如何加快它的建议。我认为最慢的操作是.apply(lambda和分组,但我不知道如何替换它们。你知道吗

...
for raw_file in raw_files:
    reader = pd.read_csv(raw_file, chunksize=100000)
    for chunk in reader:
        processed_data = task(chunk)
        for name, data in processed_data:
            save_data(name, data) # some method which saves DataFrame correctly
...


def task(data):
    data = data[data['Quantity'] != 0] # remove zero items
    # add date parts as columns
    data[['dt_year', 'dt_month', 'dt_day', 'dt_day_of_year', 'dt_day_of_week', 'dt_hour']] = \
                data.apply(lambda df: to_date_parts(df['SalesDate']), axis=1)
    # group by location-item to aggregate in different files
    grouped = data.groupby(['LocationID','ItemID'])
    result = []
    for name, group in grouped:
        result += [(name, group)]
    return result



def to_date_parts(str_date):
    date = dt.datetime.strptime(str_date.split(".")[0], '%Y-%m-%d %H:%M:%S')
    dt_year = date.year
    dt_month = date.month
    dt_day = date.day
    dt_day_of_year = date.toordinal() - dt.datetime(date.year, 1, 1).toordinal() + 1
    dt_day_of_week = date.weekday()
    dt_hour = date.hour
    return pd.Series([dt_year, dt_month, dt_day, dt_day_of_year, dt_day_of_week, dt_hour])

Tags: of代码nameinfordatadateraw
1条回答
网友
1楼 · 发布于 2024-03-29 06:55:04

Python datetime与Pandas datetime

有两个相互关联的原因让您看到性能不佳:

  1. 使用Python内置的datetime对象,而不是高效的datetime系列来存储日期。你知道吗
  2. 使用Python级别的for循环,而不是Pandasdatetime系列支持的向量化操作。你知道吗

因此,首先将您的系列转换为熊猫datetime系列:

date_format = '%Y-%m-%d %H:%M:%S'
df['SalesDate'] = pd.to_datetime(df['SalesDate'], format=date_format, errors='coerce')

然后直接从序列中提取属性:

from operator import attrgetter

# list attributes
fields = ['year', 'month', 'day', 'dayofyear', 'dayofweek', 'hour']

# extract attributes
attributes = pd.concat(attrgetter(*fields)(df['SalesDate'].dt), axis=1, keys=fields)

# join attributes to dataframe
df = df.join(attributes)

熊猫GroupBy对象

将项连接到list是不必要的:

grouped = data.groupby(['LocationID','ItemID'])
result = []
for name, group in grouped:
    result += [(name, group)]
return result

因为data.groupby(...)是一个iterable,所以您可以只return对象GroupBy

return data.groupby(['LocationID','ItemID'])

相关问题 更多 >