Pyparsing: 尝试非贪婪导致无限循环

3 投票
1 回答
670 浏览
提问于 2025-04-17 00:32

我正在尝试为RCS文件格式创建一个解析器,但在解析RCSid时遇到了无限循环的问题,这个问题发生在RCSadmin的上下文中。当我删除那行有问题的代码时

        Group(ZeroOrMore(RCSid)).setResultsName('access') + \

就不会出现卡住的情况了。RCSid的解析是成功的,字符串解析也没问题。有没有什么建议呢?

这是我目前的代码:

from   pyparsing import *
import string

# Special characters in the RCS file format
special = '$,.:;@'

RCSdigit = Word(nums, min=1, max=1).setName('RCSdigit')
RCSnum = Word(nums + '.').setName('RCSnum')
RCSidchar = CharsNotIn(special + string.whitespace).setName('RCSidchar')
RCSid = Combine(Optional(RCSnum) + ZeroOrMore(RCSidchar +
        ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')
RCSadmin = \
    Keyword('head').suppress() + \
        Optional(RCSnum).setResultsName('head') + \
        Suppress(';') + \
    Optional(Keyword('branch').suppress() +
        Optional(RCSnum).setResultsName('branch') +
        Suppress(';')
    ) + \
    Keyword('access').suppress() + \
        Group(ZeroOrMore(RCSid)).setResultsName('access') + \
        Suppress(';')

ids = ['.111abc111', '1111abc111', '1.11', '1', '1abc', 'abc',
        'abc1', 'abc1.11', 'abc.1111', '']
for i in ids:
    try:
        print i, RCSid.parseString(i)
    except ParseException, pe:
        print pe.markInputline()
for i in ids:
    line = 'head 3; branch 1; access ' + i + ';'
    try:
        print line, RCSadmin.parseString(line)
    except ParseException, pe:
        print pe.markInputline()

输出结果(在卡住时按下^C):

.111abc111 ['.111abc111']
1111abc111 ['1111abc111']
1.11 ['1.11']
1 ['1']
1abc ['1abc']
abc ['abc']
abc1 ['abc1']
abc1.11 ['abc1.11']
abc.1111 ['abc.1111']
 ['']
^Chead 3; branch 1; access .111abc111;
Traceback (most recent call last):
  File "sample.py", line 35, in <module>
    print line, RCSadmin.parseString(line)
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 1070, in parseString
    loc, tokens = self._parse( instring, 0 )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2352, in parseImpl
    loc, exprtokens = e._parse( instring, loc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2604, in parseImpl
    return self.expr._parse( instring, loc, doActions, callPreParse=False )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2724, in parseImpl
    loc, tmptokens = self.expr._parse( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2604, in parseImpl
    return self.expr._parse( instring, loc, doActions, callPreParse=False )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 945, in _parseNoCache
    loc,tokens = self.parseImpl( instring, preloc, doActions )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 2336, in parseImpl
    loc, resultlist = self.exprs[0]._parse( instring, loc, doActions, callPreParse=False )
  File "/usr/lib/pymodules/python2.6/pyparsing.py", line 943, in _parseNoCache
    if self.mayIndexError or loc >= len(instring):
KeyboardInterrupt

1 个回答

1

空字符串真的可以作为有效的RCSid吗?我觉得不太可能。虽然在你的管理语句的访问部分可能可以省略RCSid,但你已经用ZeroOrMore处理了这个问题。你应该按照规定定义你的基本元素,然后在更高级的结构中考虑Optional、ZeroOrMore等。

把RCSid改成:

RCSid = Combine(RCSnum + ZeroOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))
                |
                OneOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')

这样做的结果仍然能匹配你所有的测试案例(除了匹配空字符串''),并且能正确解析完整的RCSAdmin字符串。

编辑 这是我完整的解析器,适用于pyparsing 1.5.6:

# Special characters in the RCS file format
special = '$,.:;@'

RCSdigit = Word(nums, min=1, max=1).setName('RCSdigit')
RCSnum = Word(nums + '.').setName('RCSnum')
RCSidchar = CharsNotIn(special + string.whitespace).setName('RCSidchar')
#~ RCSid = Combine(Optional(RCSnum) + ZeroOrMore(RCSidchar +
        #~ ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')
RCSid = Combine(RCSnum + ZeroOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))
                |
                OneOrMore(RCSidchar + ZeroOrMore(RCSidchar | RCSnum))).setName('RCSid')
RCSadmin = \
    Keyword('head').suppress() + \
        Optional(RCSnum).setResultsName('head') + \
        Suppress(';') + \
    Optional(Keyword('branch').suppress() +
        Optional(RCSnum).setResultsName('branch') + 
        Suppress(';')
    ) + \
    Keyword('access').suppress() + \
        Group(ZeroOrMore(RCSid)).setResultsName('access') + \
        Suppress(';') 

撰写回答