Pandas组应用程序执行问题的回答

Pandas组应用程序执行

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在开发一个包含大量数据的程序。我正在使用python pandas模块查找数据中的错误。这通常工作得很快。然而，我现在编写的这段代码似乎要慢得多，我正在寻找一种加快速度的方法。 为了让你们正确地测试它，我上传了一段相当大的代码。你应该能够按原样运行它。代码中的注释应该解释我在这里要做的事情。任何帮助都将不胜感激。 <pre><code># -*- coding: utf-8 -*- import pandas as pd import numpy as np # Filling dataframe with data # Just ignore this part for now, real data comes from csv files, this is an example of how it looks TimeOfDay_options = ['Day','Evening','Night'] TypeOfCargo_options = ['Goods','Passengers'] np.random.seed(1234) n = 10000 df = pd.DataFrame() df['ID_number'] = np.random.randint(3, size=n) df['TimeOfDay'] = np.random.choice(TimeOfDay_options, size=n) df['TypeOfCargo'] = np.random.choice(TypeOfCargo_options, size=n) df['TrackStart'] = np.random.randint(400, size=n) * 900 df['SectionStart'] = np.nan df['SectionStop'] = np.nan grouped_df = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']) for index, group in grouped_df: if len(group) == 1: df.loc[group.index,['SectionStart']] = group['TrackStart'] df.loc[group.index,['SectionStop']] = group['TrackStart'] + 899 if len(group) > 1: track_start = group.loc[group.index[0],'TrackStart'] track_end = track_start + 899 section_stops = np.random.randint(track_start, track_end, size=len(group)) section_stops[-1] = track_end section_stops = np.sort(section_stops) section_starts = np.insert(section_stops, 0, track_start) for i,start,stop in zip(group.index,section_starts,section_stops): df.loc[i,['SectionStart']] = start df.loc[i,['SectionStop']] = stop #%% This is what a random group looks like without errors #Note that each section neatly starts where the previous section ended #There are no gaps (The whole track is defined) grouped_df.get_group((2, 'Night', 'Passengers', 323100)) #%% Introducing errors to the data df.loc[2640,'SectionStart'] += 100 df.loc[5390,'SectionStart'] += 7 #%% This is what the same group looks like after introducing errors #Note that the 'SectionStop' of row 1525 is no longer similar to the 'SectionStart' of row 2640 #This track now has a gap of 100, it is not completely defined from start to end grouped_df.get_group((2, 'Night', 'Passengers', 323100)) #%% Try to locate the errors #This is the part of the code I need to speed up def Full_coverage(group): if len(group) > 1: #Sort the grouped data by column 'SectionStart' from low to high #Updated for newer pandas version #group.sort('SectionStart', ascending=True, inplace=True) group.sort_values('SectionStart', ascending=True, inplace=True) #Some initial values, overwritten at the end of each loop #These variables correspond to the first row of the group start_km = group.iloc[0,4] end_km = group.iloc[0,5] end_km_index = group.index[0] #Loop through all the rows in the group #index is the index of the row #i is the 'SectionStart' of the row #j is the 'SectionStop' of the row #The loop starts from the 2nd row in the group for index, (i, j) in group.iloc[1:,[4,5]].iterrows(): #The start of the next row must be equal to the end of the previous row in the group if i != end_km: #Add the faulty data to the error list incomplete_coverage.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \ 'Found startpoint: '+str(i)+' (row '+str(index)+')')) #Overwrite these values for the next loop start_km = i end_km = j end_km_index = index return group #Check if the complete track is completely defined (from start to end) for each combination of: #'ID_number','TimeOfDay','TypeOfCargo','TrackStart' incomplete_coverage = [] #Create empty list for storing the error messages df_grouped = df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x)) #Print the error list print('\nFound incomplete coverage in the following rows:') for i,j in incomplete_coverage: print(i) print(j) print() #%%Time the procedure -- It is very slow, taking about 6.6 seconds on my pc %timeit df.groupby(['ID_number','TimeOfDay','TypeOfCargo','TrackStart']).apply(lambda x: Full_coverage(x)) </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

问题是，我相信，你的数据有5300个不同的组。因此，在你的功能中任何慢的东西都会被放大。可能可以使用矢量化操作而不是函数中的<code>for</code>循环来节省时间，但省去几秒钟的一个更简单的方法是<code>return 0</code>，而不是<code>return group</code>。当您<code>return group</code>时，pandas将实际创建一个新的数据对象，该对象将组合排序后的组，而您似乎不使用这些组。当您<code>return 0</code>时，pandas将组合5300个零，这要快得多。 例如： <pre><code>cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart'] groups = df.groupby(cols) print(len(groups)) # 5353 %timeit df.groupby(cols).apply(lambda group: group) # 1 loops, best of 3: 2.41 s per loop %timeit df.groupby(cols).apply(lambda group: 0) # 10 loops, best of 3: 64.3 ms per loop </code></pre> 只需将不使用的结果合并起来大约需要2.4秒；剩下的时间是循环中的实际计算，您应该尝试将其矢量化。 <hr/> 编辑： 通过在<code>for</code>循环之前进行快速的向量化检查并返回<code>0</code>，而不是<code>group</code>，我将时间减少到大约~2sec，这基本上是对每个组进行排序的成本。尝试此功能： <pre><code>def Full_coverage(group): if len(group) > 1: group = group.sort('SectionStart', ascending=True) # this condition is sufficient to find when the loop # will add to the list if np.any(group.values[1:, 4] != group.values[:-1, 5]): start_km = group.iloc[0,4] end_km = group.iloc[0,5] end_km_index = group.index[0] for index, (i, j) in group.iloc[1:,[4,5]].iterrows(): if i != end_km: incomplete_coverage.append(('Expected startpoint: '+str(end_km)+' (row '+str(end_km_index)+')', \ 'Found startpoint: '+str(i)+' (row '+str(index)+')')) start_km = i end_km = j end_km_index = index return 0 cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart'] %timeit df.groupby(cols).apply(Full_coverage) # 1 loops, best of 3: 1.74 s per loop </code></pre> <hr/> 编辑2：这里有一个例子，其中包含了我的建议，将排序移到groupby之外，并删除不必要的循环。对于给定的示例，删除循环的速度不会快得多，但如果有许多不完整项，则速度会更快： <pre><code>def Full_coverage_new(group): if len(group) > 1: mask = group.values[1:, 4] != group.values[:-1, 5] if np.any(mask): err = ('Expected startpoint: {0} (row {1}) ' 'Found startpoint: {2} (row {3})') incomplete_coverage.extend([err.format(group.iloc[i, 5], group.index[i], group.iloc[i + 1, 4], group.index[i + 1]) for i in np.where(mask)[0]]) return 0 incomplete_coverage = [] cols = ['ID_number','TimeOfDay','TypeOfCargo','TrackStart'] df_s = df.sort_values(['SectionStart','SectionStop']) df_s.groupby(cols).apply(Full_coverage_nosort) </code></pre>

Pandas组应用程序执行

1 个回答

相关Python问题