Python Rem重复

2024-04-28 21:58:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个关于在Python中删除重复项的问题。我读了很多帖子,但还没有解决。我有以下csv文件:

编辑

输入:

ID, Source, 1.A, 1.B, 1.C, 1.D
1, ESPN, 5,7,,,M
1, NY Times,,10,12,W
1, ESPN, 10,,Q,,M

输出应为:

^{pr2}$

换句话说,如果ID相同,则从源为“NY Times”的行中获取值,如果“NY Times”的行具有空值,并且来自“ESPN”源的重复行具有该单元格的值,则从具有“ESPN”源的行中获取值。对于输出,将原始的两行标记为副本并创建第三行。在

为了进一步澄清,由于我需要在具有不同列标题的许多不同csv文件上运行此脚本,我不能执行以下操作:

    def main():
        with open(input_csv, "rb") as infile:
            input_fields = ("ID", "Source", "1.A", "1.B", "1.C", "1.D")
            reader = csv.DictReader(infile, fieldnames = input_fields)
            with open(output_csv, "wb") as outfile:
                output_fields = ("ID", "Source", "1.A", "1.B", "1.C", "1.D", "d_flag")
                writer = csv.DictWriter(outfile, fieldnames = output_fields)
                writer.writerow(dict((h,h) for h in output_fields))
                next(reader)
                first_row = next(reader)
                for next_row in reader:
                    #stuff

因为我希望程序在前两列上运行,而不受表中其他列的影响。换句话说,“ID”和“Source”将出现在每个输入文件中,但其余的列将根据文件的不同而改变。在

非常感谢您能提供任何帮助!仅供参考,“来源”只能是:《纽约时报》、《ESPN》或《华尔街日报》,副本的优先顺序是:如果可以,请选择《纽约时报》,否则请选择ESPN,否则,请选择《华尔街日报》。这对每个输入文件都有效。在


Tags: 文件csvidsourcefieldsinputoutputwith
1条回答
网友
1楼 · 发布于 2024-04-28 21:58:20

下面的代码将所有记录读入一个大字典,字典的键是它们的标识符,值是将源名称映射到整个数据行的字典。然后它遍历字典并给出所需的输出。在

import csv

header = None
idfld = None
sourcefld = None

record_table = {}

with open('input.csv', 'rb') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        row = [x.strip() for x in row]

        if header is None:
            header = row
            for i, fld in enumerate(header):
                if fld == 'ID':
                    idfld = i
                elif fld == 'Source':
                    sourcefld = i
            continue

        key = row[idfld]
        sourcename = row[sourcefld]

        if key not in record_table:
            record_table[key] = {sourcename: row, "all_rows": [row]}
        else:
            if sourcename in record_table[key]:
                cur_row = record_table[key][sourcename]
                for i, fld in enumerate(row):
                    if cur_row[i] == '':
                        record_table[key][sourcename][i] = fld
            else:
                record_table[key][sourcename] = row
            record_table[key]["all_rows"].append(row)

print ', '.join(header) + ', duplicate_flag'

for recordid in record_table:
    rowdict = record_table[recordid]

    final_row = [''] * len(header)

    rowcount = len(rowdict)

    for sourcetype in ['NY Times', 'ESPN', 'Wall Street Journal']:
        if sourcetype in rowdict:
            row = rowdict[sourcetype]
            for i, fld in enumerate(row):
                if final_row[i] != '':
                    continue
                if fld != '':
                    final_row[i] = fld

    if rowcount > 1:
        for row in rowdict["all_rows"]:
            print ', '.join(row) + ', duplicate'

    print ', '.join(final_row) + ', not_duplicate'

相关问题 更多 >