Python CSV模块处理字段内引号中的逗号

1 投票
1 回答
1423 浏览
提问于 2025-04-18 18:44

我在应用程序中使用Python的csv模块来解析CSV文件中的数据。在测试应用程序时,我的同事从一个随机网站复制粘贴了一段示例文本。

这段示例文本在字段中有双引号,并且双引号内还有一个逗号。csv模块能够正确处理双引号外的逗号,但双引号内的逗号却被分到了下一个列。我查看了csv的规范,发现这个字段符合规范,因为它用另一组双引号来转义了双引号。

我在libreoffice中检查了这个文件,发现它处理得很好。

这是我在csv数据中遇到问题的一行:

company_name,company_revenue,company_start_year,company_website,company_description,company_email
Acme Inc,80000000000000,2004,http://google.com,"The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as ""Acme Rocket-Powered Products, Inc."" based in Fairfield, New Jersey. Many of its products appear to be produced specifically for Wile E. Coyote; for example, the Acme Giant Rubber Band, subtitled ""(For Tripping Road Runners)"".

Sometimes, Acme can also send living creatures through the mail, though that isn't done very often. Two examples of this are the Acme Wild-Cat, which had been used on Elmer Fudd and Sam Sheepdog (which doesn't maul its intended victim); and Acme Bumblebees in one-fifth bottles (which sting Wile E. Coyote). The Wild Cat was used in the shorts Don't Give Up the Sheep and A Mutt in a Rut, while the bees were used in the short Zoom and Bored.

While their products leave much to be desired, Acme delivery service is second to none; Wile E. can merely drop an order into a mailbox (or enter an order on a website, as in the Looney Tunes: Back in Action movie), and have the product in his hands within seconds.",roadrunner@acme.com

这是在调试日志中看到的样子:

2014-08-27 21:35:53,922 - DEBUG: company_website=http://google.com
2014-08-27 21:35:53,923 - DEBUG: company_revenue=80000000000000
2014-08-27 21:35:53,923 - DEBUG: company_start_year=2004
2014-08-27 21:35:53,923 - DEBUG: account_description=The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as "Acme Rocket-Powered Products
2014-08-27 21:35:53,924 - DEBUG: company_name=Acme Inc
2014-08-27 21:35:53,925 - DEBUG: company_email=Inc."" based in Fairfield

处理csv解析的相关代码:

with open(csvfile, 'rU') as contactsfile:
    # sniff for dialect of csvfile so we can automatically determine
    # what delimiters to use
    try:
        dialect = csv.Sniffer().sniff(contactsfile.read(2048))
    except:
        dialect = 'excel'
    get_total_jobs(contactsfile, dialect)
    contacts = csv.DictReader(contactsfile, dialect=dialect, skipinitialspace=True, quoting=csv.QUOTE_MINIMAL)
    # Start reading the rows
    for row in contacts:
        process_job()
        for key, value in row.iteritems():
            logging.debug("{}={}".format(key,value))

我明白这只是一些无用的数据,我们可能永远不会遇到这样的数据,但我们收到的csv文件不在我们的控制之中,可能会出现这种边缘情况。而且既然这是一个有效的csv文件,libreoffice能够正确处理它,我也希望能正确处理它。

我搜索了其他关于csv处理的问题,发现人们在处理引号或字段内的逗号时遇到了问题。我这两方面都处理得很好,但我的问题是当逗号嵌套在字段内的引号中时。这里有一个相同问题的问题,解决了这个问题 CSV文件中的双引号内的逗号,但那是一种不太优雅的解决方法,因为我没有保留原始内容,而这在RFC4180中是有效的处理方式。

1 个回答

2

Dialect.doublequote 属性

这个属性控制在一个字段内出现的引号字符应该如何被处理。当设置为 True 时,这个字符会被加倍;当设置为 False 时,会用一个转义字符作为引号字符的前缀。默认情况下是 True。

这个“嗅探器”把 doublequote 属性设置成了 False,但你发的 CSV 文件应该用 doublequote = True 来解析:

import csv
with open(csvfile, 'rb') as contactsfile:
    # sniff for dialect of csvfile so we can automatically determine
    # what delimiters to use
    try:
        dialect = csv.Sniffer().sniff(contactsfile.read(2048))
    except:
        dialect = 'excel'
    # get_total_jobs(contactsfile, dialect)
    contactsfile.seek(0)
    contacts = csv.DictReader(contactsfile, dialect=dialect, skipinitialspace=True,
                              quoting=csv.QUOTE_MINIMAL, doublequote=True)
    # Start reading the rows
    for row in contacts:
        for key, value in row.iteritems():
            print("{}={}".format(key,value))

结果是

company_description=The company is never clearly defined in Road Runner cartoons but appears to be a conglomerate which produces every product type imaginable, no matter how elaborate or extravagant - most of which never work as desired or expected. In the Road Runner cartoon Beep, Beep, it was referred to as "Acme Rocket-Powered Products, Inc." based in Fairfield, New Jersey. Many of its products appear to be produced specifically for Wile E. Coyote; for example, the Acme Giant Rubber Band, subtitled "(For Tripping Road Runners)".

Sometimes, Acme can also send living creatures through the mail, though that isn't done very often. Two examples of this are the Acme Wild-Cat, which had been used on Elmer Fudd and Sam Sheepdog (which doesn't maul its intended victim); and Acme Bumblebees in one-fifth bottles (which sting Wile E. Coyote). The Wild Cat was used in the shorts Don't Give Up the Sheep and A Mutt in a Rut, while the bees were used in the short Zoom and Bored.

While their products leave much to be desired, Acme delivery service is second to none; Wile E. can merely drop an order into a mailbox (or enter an order on a website, as in the Looney Tunes: Back in Action movie), and have the product in his hands within seconds.
company_website=http://google.com
company_start_year=2004
company_name=Acme Inc
company_revenue=80000000000000
company_email=roadrunner@acme.com

另外,根据文档,在 Python2 中,文件句柄应该以 'rb' 模式打开,而不是 'rU' 模式:

如果 csvfile 是一个文件对象,它必须在某些平台上用 'b' 标志打开,这样才有效。

撰写回答