python pandas:比较连续行,更新/连接连续重复项第一行的单元格
在一个按日期和记录编号排序的数据表(df.sort(['Service Date', 'Record Number']))中,我想要做以下几件事:
- 比较当前行的 'Service Date' 和 'Record Number' 与下面那一行的。
- 如果这两行是重复的,就把第一行的 'Desc' 内容和第二行的 'Desc' 内容合并在一起。
- 如果还有第三行、第四行等重复的,也要继续合并 'Desc' 内容。
我希望最终所有连续重复的 'Desc' 内容都能合并到第一次出现的那一行中。下面是我目前写的代码。我在循环中尝试过 .shift(1),但没有成功:
import pandas
with open('ALL.CSV') as inc:
indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date', 'Desc'], parse_dates=True)
indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
indf.sort(['Service Date', 'Record Number'], inplace=True)
indf['NUM'] = indf['Record Number'].shift(1)
msk = indf['NUM'] == indf['Record Number']
indf['MASK'] = msk
# print(indf)
# print(indf.dtypes)
# print(msk)
for i, row in indf.iterrows():
if row['MASK'] == False:
print('Unique.', row['Record Number'], row['Service Date'], row['Desc'])
else:
print('Dupe...', row['Record Number'], row['Service Date'], row['Desc'])
示例数据:
Record Number,Service Date,Desc 746611,05/26/2014,jiber 361783,05/27/2014,manawyddan 231485,06/02/2014,montespan 254004,06/03/2014,peshawar 369750,06/09/2014,cochleate 757701,06/10/2014,verticity 586983,06/16/2014,psychotherapist 643669,06/17/2014,discreation 252213,06/23/2014,hemiacetal 863001,06/24/2014,jiber 563798,06/30/2014,manawyddan 229226,07/01/2014,montespan 772189,07/07/2014,peshawar 412939,07/08/2014,cochleate 230209,07/14/2014,verticity 723012,07/15/2014,psychotherapist 455138,07/21/2014,discreation 605876,07/22/2014,hemiacetal 565893,07/28/2014,jiber 760420,07/29/2014,manawyddan 667002,05/27/2014,montespan 676209,06/17/2014,peshawar 828060,06/24/2014,cochleate 582821,07/01/2014,verticity 275503,07/15/2014,psychotherapist 667002,05/26/2014,discreation 676209,06/02/2014,hemiacetal 828060,06/09/2014,jiber 667002,06/10/2014,manawyddan 676209,06/17/2014,montespan 828060,06/23/2014,peshawar 667002,06/24/2014,cochleate 676209,06/30/2014,verticity 828060,07/21/2014,psychotherapist 667002,07/28/2014,discreation 676209,05/27/2014,hemiacetal 828060,06/03/2014,jiber 667002,06/10/2014,manawyddan 676209,06/16/2014,montespan 828060,06/24/2014,peshawar 667002,07/01/2014,cochleate 676209,07/07/2014,verticity 828060,07/28/2014,psychotherapist 667002,07/29/2014,discreation 828060,06/09/2014,hemiacetal 667002,06/10/2014,jiber 676209,06/17/2014,manawyddan 828060,06/23/2014,montespan 667002,06/24/2014,peshawar 676209,06/30/2014,cochleate 828060,07/21/2014,verticity 828060,06/09/2014,psychotherapist 667002,06/10/2014,discreation 676209,06/17/2014,hemiacetal 828060,06/23/2014,jiber 667002,06/24/2014,manawyddan 676209,06/30/2014,montespan
编辑:我想我可能找到了办法。有没有人看到更好的方法?谢谢!
import pandas
with open('ALL.CSV') as inc:
indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date', 'Desc'], parse_dates=True)
indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
indf.sort(['Service Date', 'Record Number'], inplace=True)
indf['NUM'] = indf['Record Number'].shift(1)
msk = indf['NUM'] == indf['Record Number']
indf['MASK'] = msk
indf.reset_index(inplace=True)
# print(indf)
# print(indf.dtypes)
# print(msk)
cnt = -1
for i, row in indf.iterrows():
cnt += 1
if row['MASK'] == False:
cnt = i
# print(i, cnt, 'Unique.', row['Record Number'], row['Service Date'], row['Desc'])
else:
cnt -= 1
# print(i, cnt, 'Dupe...', row['Record Number'], row['Service Date'], row['Desc'])
# print(indf['Desc'][cnt], indf['Desc'][i])
indf['Desc'][cnt] = '. '.join([indf['Desc'][cnt], indf['Desc'][i]])
# print(indf)
indf.drop_duplicates(['Service Date', 'Record Number'], inplace=True)
del(indf['index'])
del(indf['NUM'])
del(indf['MASK'])
indf.to_csv('ALL_fixed.csv', date_format='%m/%d/%Y', index=False)
1 个回答
0
如果你想要一个名为 'desc' 的列,里面包含所有的值(我从你的问题中理解到的意思),那么你可以根据 'date' 来分组数据,然后把所有的 'desc' 值合并成一个字符串:
#this is the aggregation function
def desc_concat(x):
return ", ".join(x)
# apply it to data grouped by date
df.groupby(['date', 'record']).agg({'desc' : desc_concat})