在Python中解析TSV和写入CSV时内存泄漏
我正在用Python写一个简单的脚本,作为学习练习。我下载了一个来自俄亥俄州选举委员会的TSV文件,想对一些数据进行处理,然后输出一个CSV文件,以便导入到另一个系统中。
我的问题是,这个脚本的内存使用得非常高,像个漏水的筛子。在处理一个154MB的TSV文件时,它会消耗2GB的内存,直到我把它停掉。
下面是我的代码,有人能帮我找出我在Python中遗漏了什么吗?
import csv
import datetime
import re
def formatAddress(row):
address = ''
if str(row['RES_HOUSE']).strip():
address += str(row['RES_HOUSE']).strip()
if str(row['RES_FRAC']).strip():
address += '-' + str(row['RES_FRAC']).strip()
if str(row['RES STREET']).strip():
address += ' ' + str(row['RES STREET']).strip()
if str(row['RES_APT']).strip():
address += ' APT ' + str(row['RES_APT']).strip()
return address
vote_type_map = {
'G': 'General',
'P': 'Primary',
'L': 'Special'
}
def formatRow(row, fieldnames):
basic_dict = {
'Voter ID': str(row['VOTER ID']).strip(),
'Date Registered': str(row['REGISTERED']).strip(),
'First Name': str(row['FIRSTNAME']).strip(),
'Last Name': str(row['LASTNAME']).strip(),
'Middle Initial': str(row['MIDDLE']).strip(),
'Name Suffix': str(row['SUFFIX']).strip(),
'Voter Status': str(row['STATUS']).strip(),
'Current Party Affiliation': str(row['PARTY']).strip(),
'Year Born': str(row['DATE OF BIRTH']).strip(),
#'Voter Address': formatAddress(row),
'Voter Address': formatAddress({'RES_HOUSE': row['RES_HOUSE'], 'RES_FRAC': row['RES_FRAC'], 'RES STREET': row['RES STREET'], 'RES_APT': row['RES_APT']}),
'City': str(row['RES_CITY']).strip(),
'State': str(row['RES_STATE']).strip(),
'Zip Code': str(row['RES_ZIP']).strip(),
'Precinct': str(row['PRECINCT']).strip(),
'Precinct Split': str(row['PRECINCT SPLIT']).strip(),
'State House District': str(row['HOUSE']).strip(),
'State Senate District': str(row['SENATE']).strip(),
'Federal Congressional District': str(row['CONGRESSIONAL']).strip(),
'City or Village Code': str(row['CITY OR VILLAGE']).strip(),
'Township': str(row['TOWNSHIP']).strip(),
'School District': str(row['SCHOOL']).strip(),
'Fire': str(row['FIRE']).strip(),
'Police': str(row['POLICE']).strip(),
'Park': str(row['PARK']).strip(),
'Road': str(row['ROAD']).strip()
}
for field in fieldnames:
m = re.search('(\d{2})(\d{4})-([GPL])', field)
if m:
vote_type = vote_type_map[m.group(3)] or 'Other'
#print { 'k1': m.group(1), 'k2': m.group(2), 'k3': m.group(3)}
d = datetime.date(year=int(m.group(2)), month=int(m.group(1)), day=1)
csv_label = d.strftime('%B %Y') + ' ' + vote_type + ' Ballot Requested'
d = None
basic_dict[csv_label] = row[field]
m = None
return basic_dict
output_rows = []
output_fields = []
with open('data.tsv', 'r') as f:
r = csv.DictReader(f, delimiter='\t')
#f.seek(0)
fieldnames = r.fieldnames
for row in r:
output_rows.append(formatRow(row, fieldnames))
f.close()
if output_rows:
output_fields = sorted(output_rows[0].keys())
with open('data_out.csv', 'wb') as f:
w = csv.DictWriter(f, output_fields, quotechar='"')
w.writeheader()
for row in output_rows:
w.writerow(row)
f.close()
3 个回答
0
也许这对遇到类似问题的人有帮助。
在逐行读取一个普通的CSV文件时,我根据某个字段决定把数据保存到文件A还是文件B,结果出现了内存溢出,导致我的系统崩溃。于是我分析了一下我的内存使用情况,做了一个小改动,结果让我每次的处理速度提高了三倍,并且解决了内存泄漏的问题。
这是我之前有内存泄漏和运行时间很长的代码:
with open('input_file.csv', 'r') as input_file, open('file_A.csv', 'w') as file_A, open('file_B.csv', 'w') as file_B):
input_csv = csv.reader(input_file)
file_A_csv = csv.writer(file_A)
file_B_csv = csv.writer(file_B)
for row in input_file:
condition_row = row[1]
if condition_row == 'condition':
file_A.writerow(row)
else:
file_B.write(row)
但是,如果你不提前声明变量(或者你读取文件时用到的多个变量),像这样:
with open('input_file.csv', 'r') as input_file, open('file_A.csv', 'w') as file_A, open('file_B.csv', 'w') as file_B):
input_csv = csv.reader(input_file)
file_A_csv = csv.writer(file_A)
file_B_csv = csv.writer(file_B)
for row in input_file:
if row[1] == 'condition':
file_A.writerow(row)
else:
file_B.write(row)
我不能解释为什么会这样,但经过一些测试,我发现我的处理速度平均快了三倍,而且我的内存使用几乎为零。
1
你并不是在“泄漏”内存,而是“使用”了大量的内存。
你把每一行文本都变成了一个包含Python字符串的字典,这样会占用比单个字符串更多的内存。想了解更多细节,可以看看这个链接:为什么我的100MB文件会占用1GB内存?
解决办法是逐步处理这些数据。其实你并不需要整个列表,因为你从来不需要回头查看之前的值。所以:
with open('data.tsv', 'r') as fin, open('data_out.csv', 'w') as fout:
r = csv.DictReader(fin, delimiter='\t')
output_fields = sorted(r.fieldnames)
w = csv.DictWriter(fout, output_fields, quotechar='"')
w.writeheader()
for row in r:
w.writerow(formatRow(row, fieldnames))
或者,更简单的方法是:
w.writerows(formatRow(row, fieldnames) for row in r)
当然,这个方法和你原来的代码稍微有点不同,因为即使输入文件是空的,它也会创建输出文件。如果这点很重要,你可以很容易地修复这个问题:
with open('data.tsv', 'r') as fin:
r = csv.DictReader(fin, delimiter='\t')
first_row = next(r)
if row:
with open('data_out.csv', 'wb') as fout:
output_fields = sorted(r.fieldnames)
w = csv.DictWriter(fout, output_fields, quotechar='"')
w.writeheader()
w.writerow(formatRow(row, fieldnames))
for row in r:
w.writerow(formatRow(row, fieldnames))
1
你正在把所有的数据放到一个很大的列表里,叫做 output_rows
。但是你需要在读取每一行的时候就处理它,而不是把所有的数据都存到一个占用内存很大的Python列表里。
with open('data.tsv', 'rb') as fin, with open('data_out.csv', 'wb') as fout:
reader = csv.DictReader(fin, delimiter='\t')
firstrow = next(r)
fieldnames = reader.fieldnames
basic_dict = formatRow(firstrow, fieldnames)
output_fields = sorted(basic_dict.keys())
writer = csv.DictWriter(fout, output_fields, quotechar='"')
writer.writeheader()
writer.writerow(basic_dict)
for row in reader:
basic_dict = formatRow(row, fieldnames)
writer.writerow(basic_dict)