如何使用Python将UTF-8文件按行分隔(逗号分隔)?
我正在尝试把一个UTF-16格式的文件转换成UTF-8格式,因为我在用Python的csv模块,而这个模块似乎不支持UTF-16文件。转换成UTF-8后,我想把这个文件按行分隔开,这样我就可以用简单的row.strip()方法把它导入到Postgres数据库中。我的Python代码大概是这样的:
with codecs.open(sourcefile, 'rU', 'UTF-16') as infile:
with open(sourcefile + '.utf8', 'wb') as outfile:
for line in infile:
outfile.write(line.encode('utf8'))
with open(sourcefile + '.utf8', 'rb') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
print row[1]
不过我现在遇到问题,无法分隔行,因为row似乎只有一个索引,打印row[1]时会出现索引范围错误——我该怎么分割这个文件呢?
Excel中的一行数据:
15,"1/2 TYPE A","98","MCDS, TX","XA","852","TX","955","148","HAPPY, TX",,"$0.00","0","0.00","$1,504","179","0.00%","100.00%","0"
32,"1/2 TYPE B","98","MCDS, MI","XA","252","MI","72","925","HAPPY, MI",,"$0.00","0","0.00","$2,504","225","0.00%","100.00%","0"
抱歉我没有描述得很清楚。简单来说,输入文件是一个UTF-16文件。我以前是用Excel打开这个文件,把一列数据用逗号分隔成多列,然后保存为csv文件。接着,我用一个Python脚本处理这个csv文件,能够读取csv文件,去掉行首尾的空白,并把数据导入到Postgres数据库中。
Python脚本中原来的导入部分(当我用逗号分隔时)大致是这样的(简化版):
for row in reader:
arg = {
'item_number': row[0].strip(),
'item_size': row[1].strip(),
'description': row[2].strip(),
#etc...
}
cur.execute(
"""INSERT INTO
"Sales"("ITEM_NUMBER","ITEM_SIZE","DESCRIPTION"")
select
%(item_number)s,
%(item_size)s )
%(description);""", arg)
但是现在我希望能直接用我的Python脚本处理UTF-16文件,把数据导入到Postgres,这样就不需要在Excel中打开文件了。我想先把文件转换成UTF-8格式,然后以某种方式去掉每一行的空白,再导入到数据库中。
我已经成功把文件转换成UTF-8格式,但现在的问题是,UTF-8文件实际上是一堆被视为“单列”的行。我该如何去掉每一行的空白呢?我不能简单地用row[0].strip(),因为文件中有些逗号是描述的一部分。
1 个回答
0
与其创建一个中间文件,不如直接使用文档中描述的转换方法,具体可以搜索一下 unicode_csv_reader
。为了方便,我把生成器转换成了生成器表达式:
import codecs
import csv
sourcefile = 'csv16.csv'
with codecs.open(sourcefile, 'rU', 'UTF-16') as infile:
reader = csv.reader((line.encode('utf-8')
for line in infile),
delimiter=',')
for row in ([item.decode('utf-8')
for item in row]
for row in reader):
print u'/'.join(row)
我用上面的代码测试了以下文件,这个文件是以大端UTF-16格式保存的:
1,2,3,4
5,6,7,8
"98°","①", "®©§™"
输出结果:
1/2/3/4
5/6/7/8
98°/①/ "®©§™"