Python Lex-Yacc(PLY):无法识别行首或字符串开头
我刚接触PLY,对Python的了解也只是入门水平。我正在尝试使用PLY-3.4和Python 2.7来学习。下面是我的代码。我想创建一个叫做QTAG的标记,它是由零个或多个空格,后面跟着'Q'或'q',再跟一个'.'和一个正整数,最后再跟一个或多个空格组成的字符串。有效的QTAG示例如下:
"Q.11 "
" Q.12 "
"q.13 "
'''
Q.14
'''
无效的QTAG示例如下:
"asdf Q.15 "
"Q. 15 "
这是我的代码:
import ply.lex as lex
class LqbLexer:
# List of token names. This is always required
tokens = [
'QTAG',
'INT'
]
# Regular expression rules for simple tokens
def t_QTAG(self,t):
r'^[ \t]*[Qq]\.[0-9]+\s+'
t.value = int(t.value.strip()[2:])
return t
# A regular expression rule with some action code
# Note addition of self parameter since we're in a class
def t_INT(self,t):
r'\d+'
t.value = int(t.value)
return t
# Define a rule so we can track line numbers
def t_newline(self,t):
r'\n+'
print "Newline found"
t.lexer.lineno += len(t.value)
# A string containing ignored characters (spaces and tabs)
t_ignore = ' \t'
# Error handling rule
def t_error(self,t):
print "Illegal character '%s'" % t.value[0]
t.lexer.skip(1)
# Build the lexer
def build(self,**kwargs):
self.lexer = lex.lex(debug=1,module=self, **kwargs)
# Test its output
def test(self,data):
self.lexer.input(data)
while True:
tok = self.lexer.token()
if not tok: break
print tok
# test it
q = LqbLexer()
q.build()
#VALID inputs
q.test("Q.11 ")
q.test(" Q.12 ")
q.test("q.13 ")
q.test('''
Q.14
''')
# INVALID ones are
q.test("asdf Q.15 ")
q.test("Q. 15 ")
我得到的输出如下:
LexToken(QTAG,11,1,0)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,12,1,4)
LexToken(QTAG,13,1,0)
Newline found
Illegal character 'Q'
Illegal character '.'
LexToken(INT,14,2,6)
Newline found
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'f'
Illegal character 'Q'
Illegal character '.'
LexToken(INT,15,3,7)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,15,3,4)
注意到只有第一个和第三个有效输入被正确识别为标记。我无法弄清楚为什么其他有效输入没有被正确处理。在t_QTAG的文档字符串中:
- 把
'^'
替换成'\A'
没有效果。 - 我尝试去掉
'^'
,这样所有有效输入都能被识别,但第二个无效输入也被识别了。
如果有人能提供帮助,我将不胜感激!
谢谢
附言:我加入了google群组ply-hack,想在那儿发帖,但我无法直接在论坛或通过邮件发帖。我不确定这个群组是否还活跃。Beazley教授也没有回复。有没有什么建议?
1 个回答
最后我自己找到了答案。发出来让其他人也能找到有用的信息。
正如 @Tadgh 指出的那样,t_ignore = ' \t'
会把空格和制表符都消耗掉,因此我无法根据上面的正则表达式来匹配 t_QTAG
,结果是第二个有效输入没有被正确处理。仔细阅读 PLY 的文档后,我了解到,如果想保持令牌的正则表达式顺序,就必须把它们定义在一个函数里,而不是像 t_ignore
那样用字符串定义。如果用字符串,PLY 会自动按从长到短的顺序排列它们,并把它们放在函数之后。这里的 t_ignore
是特别的,我想它在其他任何东西之前就被执行了。这部分文档没有清楚说明。解决这个问题的方法是定义一个新的令牌函数,比如 t_SPACETAB
,放在 t_QTAG
之后,并且什么都不返回。这样,所有的 有效 输入现在都能正确处理,除了包含三重引号的那个(包含 "Q.14"
的多行字符串)。另外,按照规范,无效 的输入也没有被处理。
多行字符串的问题:结果发现 PLY 在内部使用了 re
模块。在这个模块中,^
只在一个 字符串 的开头被解释,而不是每一行的开头,默认情况下是这样的。要改变这个行为,我需要开启多行标志,这可以通过在正则表达式中使用 (?m)
来实现。因此,为了正确处理我测试中的所有有效和无效字符串,正确的正则表达式是:
r'(?m)^\s*[Qq]\.[0-9]+\s+'
这里是修正后的代码,并添加了一些测试:
import ply.lex as lex
class LqbLexer:
# List of token names. This is always required
tokens = [
'QTAG',
'INT',
'SPACETAB'
]
# Regular expression rules for simple tokens
def t_QTAG(self,t):
# corrected regex
r'(?m)^\s*[Qq]\.[0-9]+\s+'
t.value = int(t.value.strip()[2:])
return t
# A regular expression rule with some action code
# Note addition of self parameter since we're in a class
def t_INT(self,t):
r'\d+'
t.value = int(t.value)
return t
# Define a rule so we can track line numbers
def t_newline(self,t):
r'\n+'
print "Newline found"
t.lexer.lineno += len(t.value)
# A string containing ignored characters (spaces and tabs)
# Instead of t_ignore = ' \t'
def t_SPACETAB(self,t):
r'[ \t]+'
print "Space(s) and/or tab(s)"
# Error handling rule
def t_error(self,t):
print "Illegal character '%s'" % t.value[0]
t.lexer.skip(1)
# Build the lexer
def build(self,**kwargs):
self.lexer = lex.lex(debug=1,module=self, **kwargs)
# Test its output
def test(self,data):
self.lexer.input(data)
while True:
tok = self.lexer.token()
if not tok: break
print tok
# test it
q = LqbLexer()
q.build()
print "-============Testing some VALID inputs===========-"
q.test("Q.11 ")
q.test(" Q.12 ")
q.test("q.13 ")
q.test("""
Q.14
""")
q.test("""
qewr
dhdhg
dfhg
Q.15 asda
""")
# INVALID ones are
print "-============Testing some INVALID inputs===========-"
q.test("asdf Q.16 ")
q.test("Q. 17 ")
这是输出结果:
-============Testing some VALID inputs===========-
LexToken(QTAG,11,1,0)
LexToken(QTAG,12,1,0)
LexToken(QTAG,13,1,0)
LexToken(QTAG,14,1,0)
Newline found
Illegal character 'q'
Illegal character 'e'
Illegal character 'w'
Illegal character 'r'
Newline found
Illegal character 'd'
Illegal character 'h'
Illegal character 'd'
Illegal character 'h'
Illegal character 'g'
Newline found
Illegal character 'd'
Illegal character 'f'
Illegal character 'h'
Illegal character 'g'
Newline found
LexToken(QTAG,15,6,18)
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'a'
Newline found
-============Testing some INVALID inputs===========-
Illegal character 'a'
Illegal character 's'
Illegal character 'd'
Illegal character 'f'
Space(s) and/or tab(s)
Illegal character 'Q'
Illegal character '.'
LexToken(INT,16,8,7)
Space(s) and/or tab(s)
Illegal character 'Q'
Illegal character '.'
Space(s) and/or tab(s)
LexToken(INT,17,8,4)
Space(s) and/or tab(s)