用Python中的正则表达式匹配嵌套结构问题的回答

用Python中的正则表达式匹配嵌套结构

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

编辑：<a href="https://stackoverflow.com/a/17141899/190597">falsetru's nested parser</a>比我原来的解决方案更快、更简单，我稍微修改了一下，接受了指定分隔符和项分隔符的任意regex模式： <pre><code>import re def parse_nested(text, left=r'[(]', right=r'[)]', sep=r','): """ https://stackoverflow.com/a/17141899/190597 (falsetru) """ pat = r'({}|{}|{})'.format(left, right, sep) tokens = re.split(pat, text) stack = [[]] for x in tokens: if not x or re.match(sep, x): continue if re.match(left, x): # Nest a new list inside the current list current = [] stack[-1].append(current) stack.append(current) elif re.match(right, x): stack.pop() if not stack: raise ValueError('error: opening bracket is missing') else: stack[-1].append(x) if len(stack) > 1: print(stack) raise ValueError('error: closing bracket is missing') return stack.pop() text = "a {{c1::group {{c2::containing::HINT}} a few}} {{c3::words}} or three" print(parse_nested(text, r'\s*{{', r'}}\s*')) </code></pre> 收益率 <pre><code>['a', ['c1::group', ['c2::containing::HINT'], 'a few'], ['c3::words'], 'or three'] </code></pre> <hr/> 嵌套结构不能单独与Python regex匹配，但是使用<a href="http://mail.python.org/pipermail/python-dev/2003-April/035075.html" rel="nofollow noreferrer">re.Scanner</a>构建一个基本解析器（它可以处理嵌套结构）非常容易： <pre><code>import re class Node(list): def __init__(self, parent=None): self.parent = parent class NestedParser(object): def __init__(self, left='$', right='$'): self.scanner = re.Scanner([ (left, self.left), (right, self.right), (r"\s+", None), (".+?(?=(%s|%s|$))" % (right, left), self.other), ]) self.result = Node() self.current = self.result def parse(self, content): self.scanner.scan(content) return self.result def left(self, scanner, token): new = Node(self.current) self.current.append(new) self.current = new def right(self, scanner, token): self.current = self.current.parent def other(self, scanner, token): self.current.append(token.strip()) </code></pre> 可以这样使用： <pre><code>p = NestedParser() print(p.parse("((a+b)*(c-d))")) # [[['a+b'], '*', ['c-d']]] p = NestedParser() print(p.parse("( (a ( ( c ) b ) ) ( d ) e )")) # [[['a', [['c'], 'b']], ['d'], 'e']] </code></pre> 默认情况下，<code>NestedParser</code>匹配嵌套括号。您可以传递其他正则表达式以匹配其他嵌套模式，如括号、<code>[]</code>。<a href="https://stackoverflow.com/questions/14712046/regex-to-extract-nested-patterns#14712046">For example</a> <pre><code>p = NestedParser('\[', '\]') result = (p.parse("Lorem ipsum dolor sit amet [@a xxx yyy [@b xxx yyy [@c xxx yyy]]] lorem ipsum sit amet")) # ['Lorem ipsum dolor sit amet', ['@a xxx yyy', ['@b xxx yyy', ['@c xxx yyy']]], # 'lorem ipsum sit amet'] p = NestedParser('<foo>', '</foo>') print(p.parse("<foo>BAR<foo>BAZ</foo></foo>")) # [['BAR', ['BAZ']]] </code></pre> <hr/> 当然，<code>pyparsing</code>比上面的代码能做的多得多。但就这个单一目的而言，上面的<code>NestedParser</code>对于小字符串来说大约快5倍： <pre><code>In [27]: import pyparsing as pp In [28]: data = "( (a ( ( c ) b ) ) ( d ) e )" In [32]: %timeit pp.nestedExpr().parseString(data).asList() 1000 loops, best of 3: 1.09 ms per loop In [33]: %timeit NestedParser().parse(data) 1000 loops, best of 3: 234 us per loop </code></pre> 对于更大的字符串，大约快28倍： <pre><code>In [44]: %timeit pp.nestedExpr().parseString('({})'.format(data*10000)).asList() 1 loops, best of 3: 8.27 s per loop In [45]: %timeit NestedParser().parse('({})'.format(data*10000)) 1 loops, best of 3: 297 ms per loop </code></pre>

用Python中的正则表达式匹配嵌套结构

1 个回答

相关Python问题