python pandas:比较连续行,更新/连接连续重复项第一行的单元格

0 投票
1 回答
527 浏览
提问于 2025-04-18 07:32

在一个按日期和记录编号排序的数据表(df.sort(['Service Date', 'Record Number']))中,我想要做以下几件事:

  1. 比较当前行的 'Service Date' 和 'Record Number' 与下面那一行的。
  2. 如果这两行是重复的,就把第一行的 'Desc' 内容和第二行的 'Desc' 内容合并在一起。
  3. 如果还有第三行、第四行等重复的,也要继续合并 'Desc' 内容。

我希望最终所有连续重复的 'Desc' 内容都能合并到第一次出现的那一行中。下面是我目前写的代码。我在循环中尝试过 .shift(1),但没有成功:

import pandas

with open('ALL.CSV') as inc:
    indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date', 'Desc'], parse_dates=True)
    indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
    indf.sort(['Service Date', 'Record Number'], inplace=True)
    indf['NUM'] = indf['Record Number'].shift(1)
    msk = indf['NUM'] == indf['Record Number']
    indf['MASK'] = msk
#    print(indf)
#    print(indf.dtypes)
#    print(msk)
    for i, row in indf.iterrows():
        if row['MASK'] == False:
            print('Unique.', row['Record Number'], row['Service Date'], row['Desc'])
        else:
            print('Dupe...', row['Record Number'], row['Service Date'], row['Desc'])

示例数据:

Record Number,Service Date,Desc
746611,05/26/2014,jiber
361783,05/27/2014,manawyddan
231485,06/02/2014,montespan
254004,06/03/2014,peshawar
369750,06/09/2014,cochleate
757701,06/10/2014,verticity
586983,06/16/2014,psychotherapist
643669,06/17/2014,discreation
252213,06/23/2014,hemiacetal
863001,06/24/2014,jiber
563798,06/30/2014,manawyddan
229226,07/01/2014,montespan
772189,07/07/2014,peshawar
412939,07/08/2014,cochleate
230209,07/14/2014,verticity
723012,07/15/2014,psychotherapist
455138,07/21/2014,discreation
605876,07/22/2014,hemiacetal
565893,07/28/2014,jiber
760420,07/29/2014,manawyddan
667002,05/27/2014,montespan
676209,06/17/2014,peshawar
828060,06/24/2014,cochleate
582821,07/01/2014,verticity
275503,07/15/2014,psychotherapist
667002,05/26/2014,discreation
676209,06/02/2014,hemiacetal
828060,06/09/2014,jiber
667002,06/10/2014,manawyddan
676209,06/17/2014,montespan
828060,06/23/2014,peshawar
667002,06/24/2014,cochleate
676209,06/30/2014,verticity
828060,07/21/2014,psychotherapist
667002,07/28/2014,discreation
676209,05/27/2014,hemiacetal
828060,06/03/2014,jiber
667002,06/10/2014,manawyddan
676209,06/16/2014,montespan
828060,06/24/2014,peshawar
667002,07/01/2014,cochleate
676209,07/07/2014,verticity
828060,07/28/2014,psychotherapist
667002,07/29/2014,discreation
828060,06/09/2014,hemiacetal
667002,06/10/2014,jiber
676209,06/17/2014,manawyddan
828060,06/23/2014,montespan
667002,06/24/2014,peshawar
676209,06/30/2014,cochleate
828060,07/21/2014,verticity
828060,06/09/2014,psychotherapist
667002,06/10/2014,discreation
676209,06/17/2014,hemiacetal
828060,06/23/2014,jiber
667002,06/24/2014,manawyddan
676209,06/30/2014,montespan

编辑:我想我可能找到了办法。有没有人看到更好的方法?谢谢!

import pandas

with open('ALL.CSV') as inc:
    indf = pandas.read_csv(inc, usecols=['Record Number', 'Service Date', 'Desc'], parse_dates=True)
    indf['Service Date'] = pandas.to_datetime(indf['Service Date'])
    indf.sort(['Service Date', 'Record Number'], inplace=True)
    indf['NUM'] = indf['Record Number'].shift(1)
    msk = indf['NUM'] == indf['Record Number']
    indf['MASK'] = msk
    indf.reset_index(inplace=True)
#    print(indf)
#    print(indf.dtypes)
#    print(msk)
    cnt = -1
    for i, row in indf.iterrows():
        cnt += 1
        if row['MASK'] == False:
            cnt = i
#            print(i, cnt, 'Unique.', row['Record Number'], row['Service Date'], row['Desc'])
        else:
            cnt -= 1
#            print(i, cnt, 'Dupe...', row['Record Number'], row['Service Date'], row['Desc'])
#            print(indf['Desc'][cnt], indf['Desc'][i])
            indf['Desc'][cnt] = '. '.join([indf['Desc'][cnt], indf['Desc'][i]])
#    print(indf)
    indf.drop_duplicates(['Service Date', 'Record Number'], inplace=True)
    del(indf['index'])
    del(indf['NUM'])
    del(indf['MASK'])
    indf.to_csv('ALL_fixed.csv', date_format='%m/%d/%Y', index=False)

1 个回答

0

如果你想要一个名为 'desc' 的列,里面包含所有的值(我从你的问题中理解到的意思),那么你可以根据 'date' 来分组数据,然后把所有的 'desc' 值合并成一个字符串:

#this is the aggregation function
def desc_concat(x):
    return ", ".join(x)
# apply it to data grouped by date
df.groupby(['date', 'record']).agg({'desc' : desc_concat})

撰写回答