如何在Python中遍历csv,将符合新标准的行写入新文件
我已经在这个问题上纠结了一段时间,觉得还是向专家请教一下比较好。我知道我写得不够好,搞得自己有点迷糊。
我有一个CSV文件,实际上有很多个。这部分倒不是问题。
CSV文件顶部的几行其实不是数据,但里面有一条重要的信息,就是数据的有效日期。对于某些类型的报告,这个日期在一行上,而对于其他类型则在另一行。
我的数据通常从离顶部10或11行的地方开始,但我并不总是能确定。我知道第一列总是有相同的信息(数据表的标题)。
我想从前面的几行中提取报告日期,对于A类型的文件,做一些操作A,对于B类型的文件,做一些操作B,然后把这一行写入一个新文件。我在增加行数时遇到了问题,完全不知道哪里出错了。
示例数据:
"Attribute ""OPSURVEYLEVEL2_O"" [Category = ""Retail v1""]"
Date exported: 2/16/13
Exported by user: William
Project:
Classification: Online Retail v1
Report type: Attributes
Date range: from 12/14/12 to 12/14/12
"Filter OpSurvey Level 2(mine): [ LEVEL:SENTENCE TYPE:KEYWORD {OPSURVEYLEVEL2_O:""gift certificate redemption"", OPSURVEYLEVEL2_O:""combine accounts"", OPSURVEYLEVEL2_O:""cancel account"", OPSURVEYLEVEL2_O:""saved project moved to purchased project"", OPSURVEYLEVEL2_O:""unlock account"", OPSURVEYLEVEL2_O:""affiliate promotions"", OPSURVEYLEVEL2_O:""print to store coupons"", OPSURVEYLEVEL2_O:""disclaimer not clear"", OPSURVEYLEVEL2_O:""prepaid issue"", OPSURVEYLEVEL2_O:""customer wants to use coupons for print to store"", OPSURVEYLEVEL2_O:""customer received someone else's order"", OPSURVEYLEVEL2_O:""hi-res images unavailable"", OPSURVEYLEVEL2_O:""how to re-order"", OPSURVEYLEVEL2_O:""missing items"", OPSURVEYLEVEL2_O:""missing envelopes: print to store"", OPSURVEYLEVEL2_O:""missing envelopes: mail order"", OPSURVEYLEVEL2_O:""group rooms"", OPSURVEYLEVEL2_O:""print to store"", OPSURVEYLEVEL2_O:""print to store coupons"", OPSURVEYLEVEL2_O:""publisher: card not available for print to store"", OPSURVEYLEVEL2_O:publisher}]"
Total: 905
OPSURVEYLEVEL2_O,Distinct Document,% of Document,Sentiment Score
PRINT TO STORE,297,32.82,-0.1
...
示例代码
#!/usr/bin/python
import csv, os, glob, sys, errno
path = '/path/to/Downloads'
for infile in glob.glob(os.path.join(path,'report_ATTRIBUTE_OP*.csv')):
if 'OPSURVEYLEVEL2' in infile:
prime_column = 'ops2'
elif 'OPSURVEYLEVEL3' in infile:
prime_column = 'ops3'
else:
sys.exit(errno.ENOENT)
with open(infile, "r") as csvfile:
reader = csv.reader(csvfile)
report_date = 'DATE NOT FOUND'
# import pdb; pdb.set_trace()
for row in reader:
foo = 0
while foo < 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
foo = 1
if "Date range" in row:
report_date = row[0][-8:]
break
if foo >= 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
break
if 'ops2' in prime_column:
dup_col = row[0]
row.insert(0,dup_col)
row.append(report_date)
elif 'ops3' in prime_column:
row.append(report_date)
with open('report_merge.csv', 'a') as outfile:
outfile.write(row)
reader.next()
1 个回答
0
我在这段代码中看到两个问题。
第一个问题是,这段代码无法在 row
中找到日期范围。原来的这一行:
if "Date range" in row:
... 应该改成:
if "Date range" in row[0]:
第二个问题是,这段代码:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
break
... 在数据表的表头行之后就跳出了 for
循环,因为那是最近的一个循环。我怀疑在这段代码的早期版本中,可能有另一个 while
循环。
用一个 if
语句替代 while
和 if
,代码会更简单(而且没有错误),如下所示:
for row in reader:
if foo < 1:
if row[0][0:].find('OPSURVEYLEVEL') == 0:
foo = 1
if "Date range" in row[0]: # Changed this line
print("found report date")
report_date = row[0][-8:]
else:
print(row)
if row[0][0:].find('OPSURVEYLEVEL') == 0:
break
if 'ops2' in prime_column:
dup_col = row[0]
row.insert(0,dup_col)
row.append(report_date)
elif 'ops3' in prime_column:
row.append(report_date)
with open('report_merge.csv', 'a') as outfile:
outfile.write(','.join(row)+'\n')