如何解析标记文本以便后续处理？

5 投票

4 回答

764 浏览

提问于 2025-04-15 12:43

请查看更新后的输入和输出数据，见编辑-1。

我想要做的是把

+ 1
 + 1.1
  + 1.1.1
   - 1.1.1.1
   - 1.1.1.2
 + 1.2
  - 1.2.1
  - 1.2.2
 - 1.3
+ 2
- 3

转换成一个Python的数据结构，比如说

[{'1': [{'1.1': {'1.1.1': ['1.1.1.1', '1.1.1.2']}, '1.2': ['1.2.1', '1.2.2']}, '1.3'], '2': {}}, ['3',]]

我看过很多不同的维基标记语言、Markdown、重构文本等等，但这些对我来说都太复杂了，难以理解它们是怎么工作的，因为它们需要涵盖很多标签和语法（其实我只需要这些语言中的“列表”部分，但当然是要转换成Python格式，而不是HTML）。

我也看过一些分词器、词法分析器和解析器，但这些同样比我需要的复杂多了，我也理解不了。

我完全不知道该从哪里开始，希望能得到一些帮助。谢谢！

编辑-1: 是的，行首的字符很重要，从之前和现在的输出要求来看，*表示一个有子节点的根节点，+表示有子节点，而-表示没有子节点（无论是根节点还是其他节点），只是与该节点相关的额外信息。*并不是很重要，可以用+替代（我可以用其他方法来判断根节点的状态）。

因此新的要求是只用*来表示有或没有子节点的节点，而-不能有子节点。我还改了这个规则，节点的关键不再是*后面的文本，因为这个文本以后肯定会变成实际的标题。

例如

* 1
 * 1.1
 * 1.2
  - Note for 1.2
* 2
* 3
- Note for root

会得到

[{'title': '1', 'children': [{'title': '1.1', 'children': []}, {'title': '1.2', 'children': []}]}, {'title': '2', 'children': [], 'notes': ['Note for 1.2', ]}, {'title': '3', 'children': []}, 'Note for root']

如果你有其他想法来在Python中表示这个大纲，请提出！

数据结构解析器树形结构词法分析文本解析标记语言列表格式大纲表示

4 个回答

因为你在处理一个大纲的情况，所以可以用一个栈来简化事情。简单来说，你需要创建一个栈，里面存放与大纲深度对应的 dict（字典）。当你解析到一行新内容，发现大纲的深度增加了，就把一个新的 dict 推入栈中，这个新字典是由栈顶的上一个 dict 引用的。当你解析到一行深度变浅的内容时，就把栈顶的内容弹出，回到上一级。遇到深度相同的行时，就把它添加到栈顶的 dict 中。

回答于 2025-04-15 由 Python大师

分享举报

栈是一种非常有用的数据结构，特别是在解析树形结构的时候。你可以把从最后添加的节点到根节点的路径一直保存在栈里，这样就能通过缩进的长度找到正确的父节点。像下面这样的代码应该能帮助你解析你最后的例子：

import re
line_tokens = re.compile('( *)(\\*|-) (.*)')

def parse_tree(data):
    stack = [{'title': 'Root node', 'children': []}]
    for line in data.split("\n"):
        indent, symbol, content = line_tokens.match(line).groups()        
        while len(indent) + 1 < len(stack):
            stack.pop() # Remove everything up to current parent
        if symbol == '-':
            stack[-1].setdefault('notes', []).append(content)
        elif symbol == '*':
            node = {'title': content, 'children': []}
            stack[-1]['children'].append(node)
            stack.append(node) # Add as the current deepest node
    return stack[0]

回答于 2025-04-15 由 Python大师

分享举报

编辑: 感谢澄清和规范的变化，我对我的代码进行了修改，依然使用一个明确的 Node 类作为中间步骤，以便更清楚地理解。逻辑是将行列表转换为节点列表，然后再将节点列表转换为树（通过适当地使用它们的缩进属性），最后以可读的形式打印出这棵树（这只是一个“调试帮助”步骤，用来检查树是否构建得很好，当然在脚本的最终版本中可以注释掉——而且，当然，最终版本会从文件中读取行，而不是硬编码用于调试！），最后构建所需的Python结构并打印出来。以下是代码，接下来我们会看到结果与提问者的要求几乎一致，只有一个例外——不过，先看代码:

import sys

class Node(object):
  def __init__(self, title, indent):
    self.title = title
    self.indent = indent
    self.children = []
    self.notes = []
    self.parent = None
  def __repr__(self):
    return 'Node(%s, %s, %r, %s)' % (
        self.indent, self.parent, self.title, self.notes)
  def aspython(self):
    result = dict(title=self.title, children=topython(self.children))
    if self.notes:
      result['notes'] = self.notes
    return result

def print_tree(node):
  print ' ' * node.indent, node.title
  for subnode in node.children:
    print_tree(subnode)
  for note in node.notes:
    print ' ' * node.indent, 'Note:', note

def topython(nodelist):
  return [node.aspython() for node in nodelist]

def lines_to_tree(lines):
  nodes = []
  for line in lines:
    indent = len(line) - len(line.lstrip())
    marker, body = line.strip().split(None, 1)
    if marker == '*':
      nodes.append(Node(body, indent))
    elif marker == '-':
      nodes[-1].notes.append(body)
    else:
      print>>sys.stderr, "Invalid marker %r" % marker

  tree = Node('', -1)
  curr = tree
  for node in nodes:
    while node.indent <= curr.indent:
      curr = curr.parent
    node.parent = curr
    curr.children.append(node)
    curr = node

  return tree


data = """\
* 1
 * 1.1
 * 1.2
  - Note for 1.2
* 2
* 3
- Note for root
""".splitlines()

def main():
  tree = lines_to_tree(data)
  print_tree(tree)
  print
  alist = topython(tree.children)
  print alist

if __name__ == '__main__':
  main()

运行时，这将输出:

 1
  1.1
  1.2
  Note: 1.2
 2
 3
 Note: 3

[{'children': [{'children': [], 'title': '1.1'}, {'notes': ['Note for 1.2'], 'children': [], 'title': '1.2'}], 'title': '1'}, {'children': [], 'title': '2'}, {'notes': ['Note for root'], 'children': [], 'title': '3'}]

除了键的顺序（这在字典中是无关紧要的，当然也没有保证）之外，这几乎符合要求——只是这里所有的备注都作为字典条目出现，键是 notes，值是一个字符串列表（但如果列表是空的，备注条目会被省略，这大致与问题中的示例相同）。

在当前版本的问题中，如何表示备注稍微有些不清楚；一个备注作为独立字符串出现，其他的则作为值为字符串的条目（而不是我使用的字符串列表）。不清楚是什么决定了在一种情况下备注必须作为独立字符串出现，而在其他情况下作为字典条目出现，所以我使用的这个方案更为规范；如果一个备注（如果有的话）是单个字符串而不是列表，那是否意味着如果一个节点出现多个备注就是错误的？在这方面，我使用的方案更为通用（允许一个节点有任意数量的备注，从0个开始，而不是显然在问题中暗示的只有0或1个）。

写了这么多代码（编辑前的回答差不多也是这么长，并帮助澄清和修改了规范）来提供（我希望）99% 的期望解决方案，我希望这能让原提问者满意，因为最后几处代码和/或规范的调整应该对他来说很简单！

回答于 2025-04-15 由 Python大师

分享举报

如何解析标记文本以便后续处理？

4 个回答

撰写回答