使用正则表达式从字符串中提取文本

2024-03-29 15:30:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一根很大的绳子。这个字符串中有许多段落。每个段落都以标题开始,并遵循特定模式。

示例:

== Title1 == // Paragraph starts ............. ............. // Some texts ............. End of Paragraph ===Title2 === // Paragraph starts ............. ............. // Some texts .............

标题样式为:

1.) New Paragraph title starts with an equal to ( = ) and can be followed by any number of =.

2.) After = , there can be a white space ( not necessary though ) and it is followed by text.

3.) After text completion, again there can be a white space ( not necessary ), followed by again any number of equal to's ( = ).

4.) Now the paragraph starts. I have to extract the text until it encounters a similar pattern.

有人能帮我怎么用regex吗?短暂性脑缺血发作


Tags: andoftotext标题bysomeequal
3条回答

你可以用

re.findall(r'(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)', s)

参见regex demo

细节

  • (?m)^-行首
  • =+-1个或更多=字符
  • [^\S\r\n]*-除CR和LF之外的零个或多个空格字符
  • (.*?)-第1组:除换行符以外的任何零个或多个字符,尽可能少
  • [^\S\r\n]*-除CR和LF之外的零个或多个空格字符
  • =+-1个或更多=字符
  • \s*-0+空格
  • (.*(?:\r?\n(?!==+.*?=).*)*)-第2组:
    • .*-尽可能多的零个或多个字符,而不是换行符
    • (?:\r?\n(?!=+.*?=).*)*-零个或多个
      • \r?\n(?!=+.*?=)-一个可选的CR和LF,后面不跟1+=s,然后是除换行符以外的任何字符,尽可能少,然后是1+=s
      • .*-尽可能多的零个或多个字符,而不是换行符

Python demo

import re

rx = r"(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)"
s = "== Title1 ==\n..........................\n.............\nEnd of Paragraph\n===Title2 ===\n.............\n.............\n............."
print(re.findall(rx, s))

输出:

[('Title1', '..........................\n.............\nEnd of Paragraph'), ('Title2', '.............\n.............\n.............')]

也许这有助于找到每个段落的标题和每一段的行。你知道吗

text = """== Title1 == // Paragraph starts
.............
............. // Some texts
.............
End of Paragraph
===Title2 === // Paragraph starts
.............
............. // Some texts
.............
"""
import re

reg = re.compile(r'(?:[=]+\s*\w+\s*[=]+)')

for i in text.split('\n'):
    if re.search(reg, i):
        t = re.sub(r'=', '', i)
        print('Title:', t.strip())
    else:
        print('line:', i.strip())

 # Output like this
   Title: Title1  // Paragraph starts
   line: .............
   line: ............. // Some texts
   line: .............
   line: End of Paragraph
   Title: Title2  // Paragraph starts
   line: .............
   line: ............. // Some texts
   line: .............
   line: 

你可以试试这个-

x = "== Title1   ==="
ptrn = "[=]{1,}[\s]{0,}[\w]+[\s]{0,}[=]{1,}"
if re.search(ptrn, x):
    x = x.replace('=', '').strip()

会给你Title1。或者假设你想要列表中所有匹配的标题,你可以-

x = '== Title1   ===nansnsk fnasasklsanlkas lkaslkans \n== Title2 ==='
titles = [i.replace('=', '').strip() for i in re.findall(ptrn, x)]
# OP ['Title1', 'Title2']

图案是-

"^[=]{1,}[\s]{0,}[\w]+[\s]{0,}[=]{1,}"

^[=]{1,} - match at least one equal sign at the start

[\s]{0,} - match between zero to unlimited equal signs

[\w]+ - match [a-zA-Z0-9_] once or more

之后,我们可以通过将=替换为''并将其从空格中剥离来提取文本。您可以在regex101尝试,这在测试regex时非常有用

相关问题 更多 >