Python文件解析：从文本文件构建树

20 投票

3 回答

21632 浏览

提问于 2025-04-16 18:02

我有一个缩进格式的文本文件，这个文件将用来构建一棵树。每一行代表一个节点，而缩进的空格表示这个节点的深度，以及它是哪个节点的子节点。

比如，一个文件可能长这样：

ROOT
   Node1
      Node2
         Node3
            Node4
   Node5
   Node6

这表示ROOT节点有三个子节点：1、5和6，节点1有一个子节点：2，节点2有一个子节点：3，等等。

我想出了一个递归算法，并且已经编程实现了，它是可以工作的，但看起来有点复杂，特别是在处理上面这个例子时，尤其是从节点4到节点5的转换。

这个算法是通过“缩进计数”来进行递归的，所以如果缩进的数量等于当前深度加1，我就会深入一层。但这也意味着当我读取到一个缩进少的行时，我必须逐层返回，每次都要检查深度。

这是我现在的代码：

def _recurse_tree(node, parent, depth):
    tabs = 0
    
    while node:
        tabs = node.count("\t")
        if tabs == depth:
            print "%s: %s" %(parent.strip(), node.strip())
        elif tabs == depth + 1:
            node = _recurse_tree(node, prev, depth+1)
            tabs = node.count("\t")
            
            #check if we have to surface some more
            if tabs == depth:
                print "%s: %s" %(parent.strip(), node.strip())
            else:
                return node
        else:
            return node
        
        prev = node
        node = inFile.readline().rstrip()
        
inFile = open("test.txt")
root = inFile.readline().rstrip()
node = inFile.readline().rstrip()
_recurse_tree(node, root, 1)

目前我只是打印出节点，以验证每一行的父节点是否正确，但也许有更简洁的方法来实现？尤其是在elif块中，当我从每次递归调用返回时的情况。

数据结构算法优化递归算法树结构文本文件处理深度优先搜索节点关系缩进解析

3 个回答

我不会用递归来处理这种情况（当然，如果我在用像Scheme这样的语言编写代码，可能会考虑递归，但这里是Python）。递归非常适合处理像树形结构的数据，在这种情况下，使用递归会让你的设计比普通的循环简单很多。

不过，这里并不是这样的情况。你的数据虽然代表了一棵树，但它是按顺序排列的，也就是说，它只是简单的一行行文本。这样的数据用简单的循环来处理最方便，当然，如果你愿意，也可以把它分成三个不同的层次来设计：顺序读取器（它会把制表符解析为深度级别的说明）、树插入器（它会在特定的深度级别插入一个节点，同时跟踪最后插入的节点）和树本身：

import re

# *** Tree representation ***
class Node(object):
    def __init__(self, title):
        self.title = title
        self.parent = None
        self.children = []

    def add(self, child):
        self.children.append(child)
        child.parent = self

# *** Node insertion logic ***
class Inserter(object):
    def __init__(self, node, depth = 0):
        self.node = node
        self.depth = depth

    def __call__(self, title, depth):
        newNode = Node(title)
        if (depth > self.depth):
            self.node.add(newNode)
            self.depth = depth
        elif (depth == self.depth):
            self.node.parent.add(newNode)
        else:
            parent = self.node.parent
            for i in xrange(0, self.depth - depth):
                parent = parent.parent
            parent.add(newNode)
            self.depth = depth

        self.node = newNode

# *** File iteration logic ***
with open(r'tree.txt', 'r') as f:    
    tree = Node(f.readline().rstrip('\n'))
    inserter = Inserter(tree)

    for line in f:
        line = line.rstrip('\n')
        # note there's a bug with your original tab parsing code:
        # it would count all tabs in the string, not just the ones
        # at the beginning
        tabs = re.match('\t*', line).group(0).count('\t')
        title = line[tabs:]
        inserter(title, tabs)

在我把这段代码粘贴到这里之前，我写了一个非常简单的函数来美观地打印我读入内存的树。对于这个函数，最自然的做法当然是使用递归，因为现在树确实以树形数据的方式表示：

def print_tree(node, depth = 0):
    print '%s%s' % ('  ' * depth, node.title)
    for child in node.children:
        print_tree(child, depth + 1)

print_tree(tree)

回答于 2025-04-16 由 Python大师

分享举报

如果你不一定要用递归，这种方法也可以：

from itertools import takewhile

is_tab = '\t'.__eq__

def build_tree(lines):
    lines = iter(lines)
    stack = []
    for line in lines:
        indent = len(list(takewhile(is_tab, line)))
        stack[indent:] = [line.lstrip()]
        print stack

source = '''ROOT
\tNode1
\t\tNode2
\t\t\tNode3
\t\t\t\tNode4
\tNode5
\tNode6'''

build_tree(source.split('\n'))

结果：

['ROOT']
['ROOT', 'Node1']
['ROOT', 'Node1', 'Node2']
['ROOT', 'Node1', 'Node2', 'Node3']
['ROOT', 'Node1', 'Node2', 'Node3', 'Node4']
['ROOT', 'Node5']
['ROOT', 'Node6']

回答于 2025-04-16 由 Python大师

分享举报

这里主要的问题是“前瞻”，我觉得这就是导致代码看起来不太好看的原因。我们可以稍微简化一下：

def _recurse_tree(parent, depth, source):
    last_line = source.readline().rstrip()
    while last_line:
        tabs = last_line.count('\t')
        if tabs < depth:
            break
        node = last_line.strip()
        if tabs >= depth:
            if parent is not None:
                print "%s: %s" %(parent, node)
            last_line = _recurse_tree(node, tabs+1, source)
    return last_line

inFile = open("test.txt")
_recurse_tree(None, 0, inFile)

因为我们在讨论递归，所以我特别注意不使用任何全局变量（比如source和last_line）。如果把它们放在某个解析器对象里会更符合Python的风格。

回答于 2025-04-16 由 Python大师

分享举报

Python文件解析：从文本文件构建树

3 个回答

撰写回答