尝试使用正则表达式解析包含三行头部的日志文件，这些头部标识多行文本数据

1 投票

3 回答

71 浏览

提问于 2025-04-14 17:36

我正在尝试解析的日志文件（也就是把文本拆分成一个个独立的数据部分）格式如下：

===================
DateTimeStamp - ShortSummaryOfEntry
===================

Line 1 of text
Line 2 of text 
...
Line last of text

===================
DateTimeStamp - ShortSummaryOfEntry
===================

Line 1 of text
Line 2 of text 
...
Line last of text

===================
DateTimeStamp - ShortSummaryOfEntry
===================

Line 1 of text
Line 2 of text 
...
Line last of text

....

我尝试了很多不同的模式，但都没有成功：

(={19}\n(.*)\n={19}\n)(\n.*)+(?=={19})

最后的这个“前瞻”似乎被前面的“+”给覆盖了。第0组显示的是整个文件的内容；第1组和第2组是正确的；而第3组（多行文本数据）是空的。

我想要的结果是创建包含两个字段的记录：

Record1Field1:
DateTimeStamp - ShortSummaryOfEntry
Record1Field2:
Line 1 of text
Line 2 of text 
...
Line last of text

[etc]

接下来，这些数据将被导入到一个数据库中（具体的应用还没有确定）。

正则表达式模式匹配多行文本日志解析数据拆分数据库导入前瞻断言记录创建

3 个回答

看起来下面的内容就足够了。

^(.*\n)=+\s+((?:\d+ +.*\n)+)

在一行等号之前的那一行会被保存到捕获组1，而接下来以数字开头的行组成的字符串会被保存到捕获组2。似乎没有必要确保捕获组1保存的那一行前面有等号。

需要设置 m 标志，这样 ^ 和 $ 就会匹配每一行的开始和结束，而不是整个字符串的开始和结束。

示例

这个表达式可以分解成以下几个部分。

^                   # match the beginning of the line    
(.*\n)              # match a line and save to capture group 1
=+                  # match '=' one or more times
\s+                 # match one or more whitespaces
(                   # begin capture group 2
  (?:\d+ +.*\n)     # match one or more digits followed by one or more spaces,
                    # followed by the rest of the line
  +                 # execute the preceding non-capture group 1 or more times
)                   # end capture group 2

回答于 2025-04-14 由 Python大师

分享举报

你在第3组得到那个结果是因为你重复了一个捕获组，这样会只捕获最后一次匹配的结果。

你可以使用两个捕获组，在第二组中捕获所有不以19个等号开头的行，这可以通过负向前瞻来实现：

为了匹配时启用多行模式，可以使用 re.M

^={19}\n(?!={19}$)(.+)\n={19}((?:\n(?!={19}$).*)*)

正则表达式演示

回答于 2025-04-14 由 Python大师

分享举报

你可以调整正则表达式中的分组，这样可以减少捕获组的数量，并且可以使用非贪婪模式的修饰符（regex101）：

import re

text = """\
===================
DateTimeStamp - ShortSummaryOfEntry
===================

1 Line 1 of text
2 Line 2 of text
3 Line last of text

===================
DateTimeStamp - ShortSummaryOfEntry
===================

4 Line 1 of text
5 Line 2 of text
6 Line last of text

===================
DateTimeStamp - ShortSummaryOfEntry
===================

7 Line 1 of text
8 Line 2 of text
9 Line last of text"""


pat = r"={19}\n(.+?)\n={19}\s*(.+?)\s*(?=={19}|\Z)"

for title, body in re.findall(pat, text, flags=re.S):
    print(title)
    print(body)
    print("-" * 80)

输出结果：

DateTimeStamp - ShortSummaryOfEntry
1 Line 1 of text
2 Line 2 of text
3 Line last of text
--------------------------------------------------------------------------------
DateTimeStamp - ShortSummaryOfEntry
4 Line 1 of text
5 Line 2 of text
6 Line last of text
--------------------------------------------------------------------------------
DateTimeStamp - ShortSummaryOfEntry
7 Line 1 of text
8 Line 2 of text
9 Line last of text
--------------------------------------------------------------------------------

回答于 2025-04-14 由 Python大师

分享举报

尝试使用正则表达式解析包含三行头部的日志文件，这些头部标识多行文本数据

3 个回答

撰写回答