如何在Python中拆分逗号分隔的字符串，忽略引号内的逗号

2 投票

4 回答

6617 浏览

提问于 2025-04-16 11:44

我想在Python中把一个用逗号分隔的字符串拆分开。对我来说比较棘手的是，有些数据字段里面本身就包含逗号，而且这些字段是用引号（"或'）括起来的。拆分后得到的字符串也需要把字段里的引号去掉。另外，有些字段可能是空的。

举个例子：

hey,hello,,"hello,world",'hey,world'

需要拆分成下面5个部分

['hey', 'hello', '', 'hello,world', 'hey,world']

如果有人能给我一些想法、建议或者帮助，告诉我怎么在Python中解决这个问题，我会非常感激。

谢谢， Vish

字符串处理引号处理文本解析数据清洗逗号分隔空字段处理

4 个回答

csv模块无法同时处理双引号（"）和单引号（'）作为引号的情况。如果没有提供这种功能的模块，我们就需要自己解析数据。为了避免依赖第三方模块，我们可以使用re模块来进行词法分析，利用re.MatchObject.lastindex这个小技巧，将匹配到的模式和对应的类型关联起来。

下面的代码在作为脚本运行时，可以通过所有测试，适用于Python 2.7和2.2版本。

import re

# lexical token symbols
DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5)

_pattern_tuples = (
    (r'"[^"]*"', DQUOTED),
    (r"'[^']*'", SQUOTED),
    (r",", COMMA),
    (r"$", NEWLINE), # matches end of string OR \n just before end of string
    (r"[^,\n]+", UNQUOTED), # order in the above list is important
    )
_matcher = re.compile(
    '(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')',
    ).match
_toktype = [None] + [i[1] for i in _pattern_tuples]
# need dummy at start because re.MatchObject.lastindex counts from 1 

def csv_split(text):
    """Split a csv string into a list of fields.
    Fields may be quoted with " or ' or be unquoted.
    An unquoted string can contain both a " and a ', provided neither is at
    the start of the string.
    A trailing \n will be ignored if present.
    """
    fields = []
    pos = 0
    want_field = True
    while 1:
        m = _matcher(text, pos)
        if not m:
            raise ValueError("Problem at offset %d in %r" % (pos, text))
        ttype = _toktype[m.lastindex]
        if want_field:
            if ttype in (DQUOTED, SQUOTED):
                fields.append(m.group(0)[1:-1])
                want_field = False
            elif ttype == UNQUOTED:
                fields.append(m.group(0))
                want_field = False
            elif ttype == COMMA:
                fields.append("")
            else:
                assert ttype == NEWLINE
                fields.append("")
                break
        else:
            if ttype == COMMA:
                want_field = True
            elif ttype == NEWLINE:
                break
            else:
                print "*** Error dump ***", ttype, repr(m.group(0)), fields
                raise ValueError("Missing comma at offset %d in %r" % (pos, text))
        pos = m.end(0)
    return fields

if __name__ == "__main__":
    tests = (
        ("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']),
        ("""\n""", ['']),
        ("""""", ['']),
        ("""a,b\n""", ['a', 'b']),
        ("""a,b""", ['a', 'b']),
        (""",,,\n""", ['', '', '', '']),
        ("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']),
        ("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']),
        )
    for text, expected in tests:
        result = csv_split(text)
        print
        print repr(text)
        print repr(result)
        print repr(expected)
        print result == expected

回答于 2025-04-16 由 Python大师

分享举报

听起来你想用的是 CSV 模块。

回答于 2025-04-16 由 Python大师

分享举报

(编辑：原来的答案在处理边缘的空字段时遇到了问题，因为re.findall的工作方式，所以我稍微重构了一下，并添加了一些测试。)

import re

def parse_fields(text):
    r"""
    >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\''))
    ['hey', 'hello', '', 'hello,world', 'hey,world']
    >>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\','))
    ['hey', 'hello', '', 'hello,world', 'hey,world', '']
    >>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\','))
    ['', 'hey', 'hello', '', 'hello,world', 'hey,world', '']
    >>> list(parse_fields(''))
    ['']
    >>> list(parse_fields(','))
    ['', '']
    >>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string'))
    ['testing', 'quotes not at "the" beginning \'of\' the', 'string']
    >>> list(parse_fields('testing,"unterminated quotes'))
    ['testing', '"unterminated quotes']
    """
    pos = 0
    exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""")
    while True:
        m = exp.search(text, pos)
        result = m.group(2)
        separator = m.group(3)

        yield result

        if not separator:
            break

        pos = m.end(0)

if __name__ == "__main__":
    import doctest
    doctest.testmod()

(['"]?) 用来匹配一个可选的单引号或双引号。

(.*?) 用来匹配实际的字符串。这个匹配方式是非贪婪的，意思是它会尽量匹配必要的部分，而不会把整个字符串都吃掉。这个匹配结果会被赋值给result，也就是我们最终得到的结果。

\1 是一个反向引用，用来匹配之前匹配到的单引号或双引号（如果有的话）。

(,|$) 用来匹配每个条目之间的逗号，或者行的结束。这部分会被赋值给separator。

如果separator是假的（比如说为空），那就意味着没有分隔符，也就是我们已经到达字符串的末尾——这时候就结束了。否则，我们会根据正则表达式结束的位置（m.end(0)）更新新的起始位置，然后继续循环。

回答于 2025-04-16 由 Python大师

分享举报

如何在Python中拆分逗号分隔的字符串，忽略引号内的逗号

4 个回答

撰写回答