我该如何处理这个文本文件并解析所需内容？

2 投票

4 回答

1205 浏览

提问于 2025-04-15 13:27

我正在尝试解析Python的doctest模块输出，并把它存储到一个HTML文件里。

我得到的输出大概是这样的：

**********************************************************************
File "example.py", line 16, in __main__.factorial
Failed example:
    [factorial(n) for n in range(6)]
Expected:
    [0, 1, 2, 6, 24, 120]
Got:
    [1, 1, 2, 6, 24, 120]
**********************************************************************
File "example.py", line 20, in __main__.factorial
Failed example:
    factorial(30)
Expected:
    25252859812191058636308480000000L
Got:
    265252859812191058636308480000000L
**********************************************************************
1 items had failures:
   2 of   8 in __main__.factorial
***Test Failed*** 2 failures.

每个失败的测试前面都有一行星号，这些星号用来区分每个测试失败的部分。

我想做的是提取出失败的文件名和方法，以及预期结果和实际结果。然后我想用这些信息创建一个HTML文档（或者先存到一个文本文件里，再进行第二轮解析）。

我该如何仅用Python或者结合一些UNIX命令行工具来实现这个呢？

补充：我写了一个shell脚本，能够匹配到我想要的每个部分，但我不太确定如何把每个匹配的结果重定向到各自的文件里。

python example.py | sed -n '/.*/,/^\**$/p' > `mktemp error.XXX`

HTML生成 shell脚本文本解析文本文件处理结果提取 doctest模块 UNIX命令行测试失败分析

4 个回答

我用pyparsing写了一个简单的解析器来完成这个任务。

from pyparsing import *

str = """
**********************************************************************
File "example.py", line 16, in __main__.factorial
Failed example:
    [factorial(n) for n in range(6)]
Expected:
    [0, 1, 2, 6, 24, 120]
Got:
    [1, 1, 2, 6, 24, 120]
**********************************************************************
File "example.py", line 20, in __main__.factorial
Failed example:
    factorial(30)
Expected:
    25252859812191058636308480000000L
Got:
    265252859812191058636308480000000L
**********************************************************************
"""

quote = Literal('"').suppress()
comma = Literal(',').suppress()
in_ = Keyword('in').suppress()
block = OneOrMore("**").suppress() + \
        Keyword("File").suppress() + \
        quote + Word(alphanums + ".") + quote + \
        comma + Keyword("line").suppress() + Word(nums) + comma + \
        in_ + Word(alphanums + "._") + \
        LineStart() + restOfLine.suppress() + \
        LineStart() + restOfLine + \
        LineStart() + restOfLine.suppress() + \
        LineStart() + restOfLine + \
        LineStart() + restOfLine.suppress() + \
        LineStart() + restOfLine  

all = OneOrMore(Group(block))

result = all.parseString(str)

for section in result:
    print section

结果是

['example.py', '16', '__main__.factorial', '    [factorial(n) for n in range(6)]', '    [0, 1, 2, 6, 24, 120]', '    [1, 1, 2, 6, 24, 120]']
['example.py', '20', '__main__.factorial', '    factorial(30)', '    25252859812191058636308480000000L', '    265252859812191058636308480000000L']

回答于 2025-04-15 由 Python大师

分享举报

你可以写一个Python程序来分析这个问题，但也许更好的办法是看看怎么修改doctest，让它直接输出你想要的报告。关于doctest.DocTestRunner的说明如下：

                                  ... the display output
can be also customized by subclassing DocTestRunner, and
overriding the methods `report_start`, `report_success`,
`report_unexpected_exception`, and `report_failure`.

回答于 2025-04-15 由 Python大师

分享举报

这是一个简单粗暴的脚本，它会把输出内容解析成包含相关信息的元组。

import sys
import re

stars_re = re.compile('^[*]+$', re.MULTILINE)
file_line_re = re.compile(r'^File "(.*?)", line (\d*), in (.*)$')

doctest_output = sys.stdin.read()
chunks = stars_re.split(doctest_output)[1:-1]

for chunk in chunks:
    chunk_lines = chunk.strip().splitlines()
    m = file_line_re.match(chunk_lines[0])

    file, line, module = m.groups()
    failed_example = chunk_lines[2].strip()
    expected = chunk_lines[4].strip()
        got = chunk_lines[6].strip()

    print (file, line, module, failed_example, expected, got)

回答于 2025-04-15 由 Python大师

分享举报

我该如何处理这个文本文件并解析所需内容？

4 个回答

撰写回答