如何在Python中遍历csv,将符合新标准的行写入新文件

0 投票
1 回答
820 浏览
提问于 2025-04-17 16:13

我已经在这个问题上纠结了一段时间,觉得还是向专家请教一下比较好。我知道我写得不够好,搞得自己有点迷糊。

我有一个CSV文件,实际上有很多个。这部分倒不是问题。

CSV文件顶部的几行其实不是数据,但里面有一条重要的信息,就是数据的有效日期。对于某些类型的报告,这个日期在一行上,而对于其他类型则在另一行。

我的数据通常从离顶部10或11行的地方开始,但我并不总是能确定。我知道第一列总是有相同的信息(数据表的标题)。

我想从前面的几行中提取报告日期,对于A类型的文件,做一些操作A,对于B类型的文件,做一些操作B,然后把这一行写入一个新文件。我在增加行数时遇到了问题,完全不知道哪里出错了。

示例数据:

"Attribute ""OPSURVEYLEVEL2_O"" [Category = ""Retail v1""]"
Date exported: 2/16/13
Exported by user: William
Project: 
Classification: Online Retail v1
Report type: Attributes
Date range: from 12/14/12 to 12/14/12
"Filter OpSurvey Level 2(mine):  [ LEVEL:SENTENCE TYPE:KEYWORD {OPSURVEYLEVEL2_O:""gift certificate redemption"", OPSURVEYLEVEL2_O:""combine accounts"", OPSURVEYLEVEL2_O:""cancel account"", OPSURVEYLEVEL2_O:""saved project moved to purchased project"", OPSURVEYLEVEL2_O:""unlock account"", OPSURVEYLEVEL2_O:""affiliate promotions"", OPSURVEYLEVEL2_O:""print to store coupons"", OPSURVEYLEVEL2_O:""disclaimer not clear"", OPSURVEYLEVEL2_O:""prepaid issue"", OPSURVEYLEVEL2_O:""customer wants to use coupons for print to store"", OPSURVEYLEVEL2_O:""customer received someone else's order"", OPSURVEYLEVEL2_O:""hi-res images unavailable"", OPSURVEYLEVEL2_O:""how to re-order"", OPSURVEYLEVEL2_O:""missing items"", OPSURVEYLEVEL2_O:""missing envelopes: print to store"", OPSURVEYLEVEL2_O:""missing envelopes: mail order"", OPSURVEYLEVEL2_O:""group rooms"", OPSURVEYLEVEL2_O:""print to store"", OPSURVEYLEVEL2_O:""print to store coupons"", OPSURVEYLEVEL2_O:""publisher: card not available for print to store"", OPSURVEYLEVEL2_O:publisher}]"
Total: 905
OPSURVEYLEVEL2_O,Distinct Document,% of Document,Sentiment Score
PRINT TO STORE,297,32.82,-0.1
...

示例代码

#!/usr/bin/python

import csv, os, glob, sys, errno

path = '/path/to/Downloads'
for infile in glob.glob(os.path.join(path,'report_ATTRIBUTE_OP*.csv')):
    if 'OPSURVEYLEVEL2' in infile:
        prime_column = 'ops2'
    elif 'OPSURVEYLEVEL3' in infile:
        prime_column = 'ops3'
    else:
        sys.exit(errno.ENOENT)
    with open(infile, "r") as csvfile:
        reader = csv.reader(csvfile)
        report_date = 'DATE NOT FOUND'
        # import pdb; pdb.set_trace()
        for row in reader:
            foo = 0
            while foo < 1: 
                if row[0][0:].find('OPSURVEYLEVEL') == 0:
                    foo = 1
                if "Date range" in row:
                    report_date = row[0][-8:]
                break
            if foo >= 1:
                if row[0][0:].find('OPSURVEYLEVEL') == 0:
                    break
                if 'ops2' in prime_column:
                    dup_col = row[0]
                    row.insert(0,dup_col)
                    row.append(report_date)
                elif 'ops3' in prime_column:
                    row.append(report_date)
                with open('report_merge.csv', 'a') as outfile:
                    outfile.write(row)
            reader.next()

1 个回答

0

我在这段代码中看到两个问题。

第一个问题是,这段代码无法在 row 中找到日期范围。原来的这一行:

if "Date range" in row:

... 应该改成:

if "Date range" in row[0]:

第二个问题是,这段代码:

if row[0][0:].find('OPSURVEYLEVEL') == 0:
    break

... 在数据表的表头行之后就跳出了 for 循环,因为那是最近的一个循环。我怀疑在这段代码的早期版本中,可能有另一个 while 循环。

用一个 if 语句替代 whileif,代码会更简单(而且没有错误),如下所示:

    for row in reader:
        if foo < 1: 
            if row[0][0:].find('OPSURVEYLEVEL') == 0:
                foo = 1
            if "Date range" in row[0]:  # Changed this line
                print("found report date")
                report_date = row[0][-8:]
        else:
            print(row)
            if row[0][0:].find('OPSURVEYLEVEL') == 0:
                break
            if 'ops2' in prime_column:
                dup_col = row[0]
                row.insert(0,dup_col)
                row.append(report_date)
            elif 'ops3' in prime_column:
                row.append(report_date)
            with open('report_merge.csv', 'a') as outfile:
                outfile.write(','.join(row)+'\n')

撰写回答