我正在开发一个包含大量数据的程序。我正在使用python pandas模块查找数据中的错误。这通常工作得很快。然而,我现在编写的这段代码似乎要慢得多,我正在寻找一种加快速度的方法。
为了让你们正确地测试它,我上传了一段相当大的代码。你应该能够按原样运行它。代码中的注释应该解释我在这里要做的事情。任何帮助都将不胜感激。
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
# Filling dataframe with data
# Just ignore this part for now, real data comes from csv files, this is an example of how it looks
TimeOfDay_options = ['Day','Evening','Night']
TypeOfCargo_options = ['Goods','Passengers']
np.random.seed(1234)
n = 10000
df = pd.DataFrame()
df['ID_number'] = np.random.randint(3, size=n)
df['TimeOfDay'] = np.random.choice(TimeOfDay_options, size=n)
df['TypeOfCargo'] = np.random.choice(TypeOfCargo_options, size=n)
df['TrackStart'] = np.random.randint(400, size=n) * 900
df['SectionStart'] = np.nan
df['SectionStop'] = np.nan
grouped_df = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart'])
for index, group in grouped_df:
if len(group) == 1:
df.loc[group.index,['SectionStart']] = group['TrackStart']
df.loc[group.index,['SectionStop']] = group['TrackStart'] + 899
if len(group) > 1:
track_start = group.loc[group.index[0],'TrackStart']
track_end = track_start + 899
section_stops = np.random.randint(track_start, track_end, size=len(group))
section_stops[-1] = track_end
section_stops = np.sort(section_stops)
section_starts = np.insert(section_stops, 0, track_start)
for i,start,stop in zip(group.index,section_starts,section_stops):
df.loc[i,['SectionStart']] = start
df.loc[i,['SectionStop']] = stop
#%% This is what a random group looks like without errors
#Note that each section neatly starts where the previous section ended
#There are no gaps (The whole track is defined)
grouped_df.get_group((2, 'Night', 'Passengers', 323100))
#%% Introducing errors to the data
df.loc[2640,'SectionStart'] += 100
df.loc[5390,'SectionStart'] += 7
#%% This is what the same group looks like after introducing errors
#Note that the 'SectionStop' of row 1525 is no longer similar to the 'SectionStart' of row 2640
#This track now has a gap of 100, it is not completely defined from start to end
grouped_df.get_group((2, 'Night', 'Passengers', 323100))
#%% Try to locate the errors
#This is the part of the code I need to speed up
def Full_coverage(group):
if len(group) > 1:
#Sort the grouped data by column 'SectionStart' from low to high
#Updated for newer pandas version
#group.sort('SectionStart', ascending=True, inplace=True)
group.sort_values('SectionStart', ascending=True, inplace=True)
#Some initial values, overwritten at the end of each loop
#These variables correspond to the first row of the group
start_km = group.iloc[0,4]
end_km = group.iloc[0,5]
end_km_index = group.index[0]
#Loop through all the rows in the group
#index is the index of the row
#i is the 'SectionStart' of the row
#j is the 'SectionStop' of the row
#The loop starts from the 2nd row in the group
for index, (i, j) in group.iloc[1:,[4,5]].iterrows():
#The start of the next row must be equal to the end of the previous row in the group
if i != end_km:
#Add the faulty data to the error list
incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \
'Found startpoint: '+str(i)+' (row '+str(index)+')'))
#Overwrite these values for the next loop
start_km = i
end_km = j
end_km_index = index
return group
#Check if the complete track is completely defined (from start to end) for each combination of:
#'ID_number','TimeOfDay','TypeOfCargo','TrackStart'
incomplete_coverage = [] #Create empty list for storing the error messages
df_grouped = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x))
#Print the error list
print('\nFound incomplete coverage in the following rows:')
for i,j in incomplete_coverage:
print(i)
print(j)
print()
#%%Time the procedure -- It is very slow, taking about 6.6 seconds on my pc
%timeit df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x))
问题是,我相信,你的数据有5300个不同的组。因此,在你的功能中任何慢的东西都会被放大。您可以在函数中使用向量化操作而不是
for
循环来节省时间,但是一种更简单的方法是return 0
而不是return group
。当您return group
时,pandas将实际创建一个新的数据对象,该对象将组合排序后的组,而您似乎不使用这些组。当您return 0
时,pandas将组合5300个零,这要快得多。例如:
只需将不使用的结果合并起来大约需要2.4秒;剩下的时间是循环中的实际计算,您应该尝试将其矢量化。
编辑:
通过在
for
循环之前进行快速的向量化检查并返回0
,而不是group
,我将时间减少到大约~2sec,这基本上是对每个组进行排序的成本。尝试此功能:编辑2:这里有一个例子,其中包含了我的建议,将排序移到groupby之外,并删除不必要的循环。对于给定的示例,删除循环的速度不会快得多,但如果有许多不完整项,则速度会更快:
我发现熊猫定位命令(.loc或.iloc)也在减缓进程。通过将sort移出循环并在函数开始时将数据转换为numpy数组,我得到了一个更快的结果。我知道数据不再是数据帧,但是列表中返回的索引可用于在原始df中查找数据。
如果有任何方法可以进一步加快这一进程,我将非常感谢你的帮助。到目前为止我所拥有的:
问题是,我相信,你的数据有5300个不同的组。因此,在你的功能中任何慢的东西都会被放大。可能可以使用矢量化操作而不是函数中的
for
循环来节省时间,但省去几秒钟的一个更简单的方法是return 0
,而不是return group
。当您return group
时,pandas将实际创建一个新的数据对象,该对象将组合排序后的组,而您似乎不使用这些组。当您return 0
时,pandas将组合5300个零,这要快得多。例如:
只需将不使用的结果合并起来大约需要2.4秒;剩下的时间是循环中的实际计算,您应该尝试将其矢量化。
编辑:
通过在
for
循环之前进行快速的向量化检查并返回0
,而不是group
,我将时间减少到大约~2sec,这基本上是对每个组进行排序的成本。尝试此功能:编辑2:这里有一个例子,其中包含了我的建议,将排序移到groupby之外,并删除不必要的循环。对于给定的示例,删除循环的速度不会快得多,但如果有许多不完整项,则速度会更快:
相关问题 更多 >
编程相关推荐