Python高级正则表达式

2024-05-14 19:17:10 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一段文字看起来是这样的:

TTL1 | TTL2 | TTL3
some text in a line1
some text in a line2
some text in a line3
TTL1 | TTL2 | 
TTL3
some text in a line1
some text in a line2
some text in a line3
some text in a line4
some text in a line5
TTL1 | TTL2 | TTL3
some text in a line1
some text in a line2
some text in a line3
some text in a line4
...

解释:我有标题行,有时可以分成多行,然后我有很多其他行。 我想捕获所有标题(即使它们在不同的行中),并在一个组中捕获标题后面的所有行。你知道吗

我有truoble与多行标题和多行内容,我不知道如何提取它与regex和python。你知道吗

有什么想法吗?你知道吗


Tags: textin标题内容some文字line1line2
2条回答

你可以试试这个:

\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\s*\n([^\|]*)(?:\n|$)

根据op的评论,奇怪的是,行中可能包含|,这使得很难区分标题和行,因此可以尝试以下解决方案:

^\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\n(.*?)(?=^\s*\w+\s*\n*\|\s*\n*\w+\s*\n*\|\s*\n*\w+\s*\n*)|^\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\n(.*)$

Updated Regex Explanation

Explanation

示例代码:

import re

regex = r"\s*(\w+)\s*\|\s*(\w+)\s*\|\s*(\w+)\s*\n([^\|]*)(?:\n|$)"

test_str = ("TTL1 | TTL2 | TTL3\n"
    "some text in a line1\n"
    "some text in a line2\n"
    "some text in a line3\n"
    "TTL1 | TTL2 | \n"
    "TTL3\n"
    "some text in a line1\n"
    "some text in a line2\n"
    "some text in a line3\n"
    "some text in a line4\n"
    "some text in a line5\n"
    "TTL1 | TTL2 | TTL3\n"
    "some text in a line1\n"
    "some text in a line2\n"
    "some text in a line3\n"
    "some text in a line4")

matches = re.finditer(regex, test_str, re.DOTALL)

for matchNum, match in enumerate(matches):
  print(match.group(1))
  print(match.group(2))
  print(match.group(3))
  print(match.group(4))

Run it here

样本输出:

TTL1
TTL2
TTL3
some text in a line1
some text in a line2
some text in a line3
TTL1
TTL2
TTL3
some text in a line1
some text in a line2
some text in a line3
some text in a line4
some text in a line5
TTL1
TTL2
TTL3
some text in a line1
some text in a line2
some text in a line3
some text in a line4

re.findall()函数使用以下方法:

# lines.txt is a file containing the initial text from your question 
with open('lines.txt', 'r') as fh:
    t = fh.read()
    items = re.findall(r'([A-Z\d\s|]+)([^A-Z]+)', t)

# 'h' contains header, 'lines' contains the lines related to current header 
for h, lines in items:
    print(h.replace('\n', ' '), lines, sep='\n')

输出:

TTL1 | TTL2 | TTL3 
some text in a line1
some text in a line2
some text in a line3

TTL1 | TTL2 | TTL3 
some text in a line1
some text in a line2
some text in a line3
some text in a line4
some text in a line5

TTL1 | TTL2 | TTL3 
some text in a line1
some text in a line2
some text in a line3
some text in a line4

相关问题 更多 >

    热门问题