如何使用Python将UTF-8文件按行分隔(逗号分隔)?

0 投票
1 回答
1580 浏览
提问于 2025-04-18 18:38

我正在尝试把一个UTF-16格式的文件转换成UTF-8格式,因为我在用Python的csv模块,而这个模块似乎不支持UTF-16文件。转换成UTF-8后,我想把这个文件按行分隔开,这样我就可以用简单的row.strip()方法把它导入到Postgres数据库中。我的Python代码大概是这样的:

with codecs.open(sourcefile, 'rU', 'UTF-16') as infile:
   with open(sourcefile + '.utf8', 'wb') as outfile:
       for line in infile:
           outfile.write(line.encode('utf8'))


with open(sourcefile + '.utf8', 'rb') as f:
    reader = csv.reader(f, delimiter=',')

    for row in reader:
        print row[1]

不过我现在遇到问题,无法分隔行,因为row似乎只有一个索引,打印row[1]时会出现索引范围错误——我该怎么分割这个文件呢?

Excel中的一行数据:

15,"1/2 TYPE A","98","MCDS, TX","XA","852","TX","955","148","HAPPY, TX",,"$0.00","0","0.00","$1,504","179","0.00%","100.00%","0"
32,"1/2 TYPE B","98","MCDS, MI","XA","252","MI","72","925","HAPPY, MI",,"$0.00","0","0.00","$2,504","225","0.00%","100.00%","0"

抱歉我没有描述得很清楚。简单来说,输入文件是一个UTF-16文件。我以前是用Excel打开这个文件,把一列数据用逗号分隔成多列,然后保存为csv文件。接着,我用一个Python脚本处理这个csv文件,能够读取csv文件,去掉行首尾的空白,并把数据导入到Postgres数据库中。

Python脚本中原来的导入部分(当我用逗号分隔时)大致是这样的(简化版):

 for row in reader:
    arg = {
            'item_number': row[0].strip(),
            'item_size': row[1].strip(),
            'description': row[2].strip(),
            #etc...
        }
        cur.execute(
            """INSERT INTO 
            "Sales"("ITEM_NUMBER","ITEM_SIZE","DESCRIPTION"")
             select
                %(item_number)s, 
                %(item_size)s )
                %(description);""", arg)

但是现在我希望能直接用我的Python脚本处理UTF-16文件,把数据导入到Postgres,这样就不需要在Excel中打开文件了。我想先把文件转换成UTF-8格式,然后以某种方式去掉每一行的空白,再导入到数据库中。

我已经成功把文件转换成UTF-8格式,但现在的问题是,UTF-8文件实际上是一堆被视为“单列”的行。我该如何去掉每一行的空白呢?我不能简单地用row[0].strip(),因为文件中有些逗号是描述的一部分。

1 个回答

0

与其创建一个中间文件,不如直接使用文档中描述的转换方法,具体可以搜索一下 unicode_csv_reader。为了方便,我把生成器转换成了生成器表达式:

import codecs
import csv

sourcefile = 'csv16.csv'
with codecs.open(sourcefile, 'rU', 'UTF-16') as infile:
    reader = csv.reader((line.encode('utf-8')
                         for line in infile),
                        delimiter=',')
    for row in ([item.decode('utf-8')
                 for item in row]
                for row in reader):
        print u'/'.join(row)

我用上面的代码测试了以下文件,这个文件是以大端UTF-16格式保存的:

1,2,3,4
5,6,7,8
"98°","①", "®©§™"

输出结果:

1/2/3/4
5/6/7/8
98°/①/ "®©§™"

撰写回答