如何在Python中拆分逗号分隔的字符串,忽略引号内的逗号
我想在Python中把一个用逗号分隔的字符串拆分开。对我来说比较棘手的是,有些数据字段里面本身就包含逗号,而且这些字段是用引号("
或'
)括起来的。拆分后得到的字符串也需要把字段里的引号去掉。另外,有些字段可能是空的。
举个例子:
hey,hello,,"hello,world",'hey,world'
需要拆分成下面5个部分
['hey', 'hello', '', 'hello,world', 'hey,world']
如果有人能给我一些想法、建议或者帮助,告诉我怎么在Python中解决这个问题,我会非常感激。
谢谢, Vish
4 个回答
2
csv模块无法同时处理双引号(")和单引号(')作为引号的情况。如果没有提供这种功能的模块,我们就需要自己解析数据。为了避免依赖第三方模块,我们可以使用re
模块来进行词法分析,利用re.MatchObject.lastindex这个小技巧,将匹配到的模式和对应的类型关联起来。
下面的代码在作为脚本运行时,可以通过所有测试,适用于Python 2.7和2.2版本。
import re
# lexical token symbols
DQUOTED, SQUOTED, UNQUOTED, COMMA, NEWLINE = xrange(5)
_pattern_tuples = (
(r'"[^"]*"', DQUOTED),
(r"'[^']*'", SQUOTED),
(r",", COMMA),
(r"$", NEWLINE), # matches end of string OR \n just before end of string
(r"[^,\n]+", UNQUOTED), # order in the above list is important
)
_matcher = re.compile(
'(' + ')|('.join([i[0] for i in _pattern_tuples]) + ')',
).match
_toktype = [None] + [i[1] for i in _pattern_tuples]
# need dummy at start because re.MatchObject.lastindex counts from 1
def csv_split(text):
"""Split a csv string into a list of fields.
Fields may be quoted with " or ' or be unquoted.
An unquoted string can contain both a " and a ', provided neither is at
the start of the string.
A trailing \n will be ignored if present.
"""
fields = []
pos = 0
want_field = True
while 1:
m = _matcher(text, pos)
if not m:
raise ValueError("Problem at offset %d in %r" % (pos, text))
ttype = _toktype[m.lastindex]
if want_field:
if ttype in (DQUOTED, SQUOTED):
fields.append(m.group(0)[1:-1])
want_field = False
elif ttype == UNQUOTED:
fields.append(m.group(0))
want_field = False
elif ttype == COMMA:
fields.append("")
else:
assert ttype == NEWLINE
fields.append("")
break
else:
if ttype == COMMA:
want_field = True
elif ttype == NEWLINE:
break
else:
print "*** Error dump ***", ttype, repr(m.group(0)), fields
raise ValueError("Missing comma at offset %d in %r" % (pos, text))
pos = m.end(0)
return fields
if __name__ == "__main__":
tests = (
("""hey,hello,,"hello,world",'hey,world'\n""", ['hey', 'hello', '', 'hello,world', 'hey,world']),
("""\n""", ['']),
("""""", ['']),
("""a,b\n""", ['a', 'b']),
("""a,b""", ['a', 'b']),
(""",,,\n""", ['', '', '', '']),
("""a,contains both " and ',c""", ['a', 'contains both " and \'', 'c']),
("""a,'"starts with "...',c""", ['a', '"starts with "...', 'c']),
)
for text, expected in tests:
result = csv_split(text)
print
print repr(text)
print repr(result)
print repr(expected)
print result == expected
8
听起来你想用的是 CSV 模块。
5
(编辑:原来的答案在处理边缘的空字段时遇到了问题,因为re.findall
的工作方式,所以我稍微重构了一下,并添加了一些测试。)
import re
def parse_fields(text):
r"""
>>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\''))
['hey', 'hello', '', 'hello,world', 'hey,world']
>>> list(parse_fields('hey,hello,,"hello,world",\'hey,world\','))
['hey', 'hello', '', 'hello,world', 'hey,world', '']
>>> list(parse_fields(',hey,hello,,"hello,world",\'hey,world\','))
['', 'hey', 'hello', '', 'hello,world', 'hey,world', '']
>>> list(parse_fields(''))
['']
>>> list(parse_fields(','))
['', '']
>>> list(parse_fields('testing,quotes not at "the" beginning \'of\' the,string'))
['testing', 'quotes not at "the" beginning \'of\' the', 'string']
>>> list(parse_fields('testing,"unterminated quotes'))
['testing', '"unterminated quotes']
"""
pos = 0
exp = re.compile(r"""(['"]?)(.*?)\1(,|$)""")
while True:
m = exp.search(text, pos)
result = m.group(2)
separator = m.group(3)
yield result
if not separator:
break
pos = m.end(0)
if __name__ == "__main__":
import doctest
doctest.testmod()
(['"]?)
用来匹配一个可选的单引号或双引号。
(.*?)
用来匹配实际的字符串。这个匹配方式是非贪婪的,意思是它会尽量匹配必要的部分,而不会把整个字符串都吃掉。这个匹配结果会被赋值给result
,也就是我们最终得到的结果。
\1
是一个反向引用,用来匹配之前匹配到的单引号或双引号(如果有的话)。
(,|$)
用来匹配每个条目之间的逗号,或者行的结束。这部分会被赋值给separator
。
如果separator
是假的(比如说为空),那就意味着没有分隔符,也就是我们已经到达字符串的末尾——这时候就结束了。否则,我们会根据正则表达式结束的位置(m.end(0)
)更新新的起始位置,然后继续循环。