如何解析代码（Python）？

6 投票

3 回答

5022 浏览

提问于 2025-04-16 13:09

我需要解析一些特殊的数据结构。它们的格式有点像C语言，差不多是这样的：

Group("GroupName") {
    /* C-Style comment */
    Group("AnotherGroupName") {
        Entry("some","variables",0,3.141);
        Entry("other","variables",1,2.718);
    }
    Entry("linebreaks",
          "allowed",
          3,
          1.414
         );
}

我想到几种处理这个问题的方法。我可以用正则表达式把代码“切分”成小块。我也可以一个字符一个字符地读取代码，然后用状态机来构建我的数据结构。还可以去掉逗号和换行符，逐行读取内容。或者我可以写一个转换脚本，把这些代码转换成可以执行的Python代码。

有没有什么好的Python方法来解析这样的文件呢？
你会怎么解析它呢？

这个问题更一般，主要是关于如何解析字符串，而不是特定的文件格式。

正则表达式数据结构字符串处理代码转换数据解析状态机文件解析

3 个回答

这要看你需要这个功能的频率，以及语法是否保持不变。如果答案是“经常需要”和“差不多是这样”，那么我建议你考虑一种表达语法的方式，并使用像 PyPEG 或 LEPL 这样的工具来写一个特定语言的解析器。定义解析器的规则是个大工程，所以如果你不常处理同样类型的文件，这样做可能就不太划算。

不过，如果你查看PyPEG的页面，它会告诉你如何将解析后的数据输出为XML格式。如果这个工具的功能不够强大，你可以先用它生成XML，然后再使用例如 lxml 来解析这个XML。

回答于 2025-04-16 由 Python大师

分享举报

可以看看pyparsing这个项目。里面有很多解析的例子。

回答于 2025-04-16 由 Python大师

分享举报

使用pyparsing这个库（Mark Tolonen，我正要点击“提交帖子”时，你的帖子刚好发过来），这其实很简单——请看下面代码中的注释：

data = """Group("GroupName") { 
    /* C-Style comment */ 
    Group("AnotherGroupName") { 
        Entry("some","variables",0,3.141); 
        Entry("other","variables",1,2.718); 
    } 
    Entry("linebreaks", 
          "allowed", 
          3, 
          1.414 
         ); 
} """

from pyparsing import *

# define basic punctuation and data types
LBRACE,RBRACE,LPAREN,RPAREN,SEMI = map(Suppress,"{}();")
GROUP = Keyword("Group")
ENTRY = Keyword("Entry")

# use parse actions to do parse-time conversion of values
real = Regex(r"[+-]?\d+\.\d*").setParseAction(lambda t:float(t[0]))
integer = Regex(r"[+-]?\d+").setParseAction(lambda t:int(t[0]))

# parses a string enclosed in quotes, but strips off the quotes at parse time
string = QuotedString('"')

# define structure expressions
value = string | real | integer
entry = Group(ENTRY + LPAREN + Group(Optional(delimitedList(value)))) + RPAREN + SEMI

# since Groups can contain Groups, need to use a Forward to define recursive expression
group = Forward()
group << Group(GROUP + LPAREN + string("name") + RPAREN + 
            LBRACE + Group(ZeroOrMore(group | entry))("body") + RBRACE)

# ignore C style comments wherever they occur
group.ignore(cStyleComment)

# parse the sample text
result = group.parseString(data)

# print out the tokens as a nice indented list using pprint
from pprint import pprint
pprint(result.asList())

输出结果是

[['Group',
  'GroupName',
  [['Group',
    'AnotherGroupName',
    [['Entry', ['some', 'variables', 0, 3.141]],
     ['Entry', ['other', 'variables', 1, 2.718]]]],
   ['Entry', ['linebreaks', 'allowed', 3, 1.4139999999999999]]]]]

（不幸的是，可能会有些混淆，因为pyparsing定义了一个“Group”类，用来给解析出来的内容加上结构——注意在一个Entry中，值的列表是如何被分组的，因为列表表达式被包裹在一个pyparsing的Group里。）

回答于 2025-04-16 由 Python大师

分享举报

如何解析代码（Python）？

3 个回答

撰写回答