使用正则表达式从字符串中提取文本

3条回答

网友
1楼 · 编辑于 2024-05-13 23:42:54

你可以用
re.findall(r'(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)', s)
参见regex demo
细节
(?m)^-行首
=+-1个或更多=字符
[^\S\r\n]*-除CR和LF之外的零个或多个空格字符
(.*?)-第1组：除换行符以外的任何零个或多个字符，尽可能少
[^\S\r\n]*-除CR和LF之外的零个或多个空格字符
=+-1个或更多=字符
\s*-0+空格
(.*(?:\r?\n(?!==+.*?=).*)*)-第2组：
.*-尽可能多的零个或多个字符，而不是换行符
(?:\r?\n(?!=+.*?=).*)*-零个或多个
\r?\n(?!=+.*?=)-一个可选的CR和LF，后面不跟1+=s，然后是除换行符以外的任何字符，尽可能少，然后是1+=s
.*-尽可能多的零个或多个字符，而不是换行符
Python demo：
import re rx = r"(?m)^=+[^\S\r\n]*(.*?)[^\S\r\n]*=+\s*(.*(?:\r?\n(?!=+.*?=).*)*)" s = "== Title1 ==\n..........................\n.............\nEnd of Paragraph\n===Title2 ===\n.............\n.............\n............." print(re.findall(rx, s))
输出：
[('Title1', '..........................\n.............\nEnd of Paragraph'), ('Title2', '.............\n.............\n.............')]

网友
2楼 · 编辑于 2024-05-13 23:42:54

也许这有助于找到每个段落的标题和每一段的行。你知道吗
text = """== Title1 == // Paragraph starts ............. ............. // Some texts ............. End of Paragraph ===Title2 === // Paragraph starts ............. ............. // Some texts ............. """ import re reg = re.compile(r'(?:[=]+\s*\w+\s*[=]+)') for i in text.split('\n'): if re.search(reg, i): t = re.sub(r'=', '', i) print('Title:', t.strip()) else: print('line:', i.strip()) # Output like this Title: Title1 // Paragraph starts line: ............. line: ............. // Some texts line: ............. line: End of Paragraph Title: Title2 // Paragraph starts line: ............. line: ............. // Some texts line: ............. line:

网友
3楼 · 编辑于 2024-05-13 23:42:54

你可以试试这个-

x = "== Title1   ==="
ptrn = "[=]{1,}[\s]{0,}[\w]+[\s]{0,}[=]{1,}"
if re.search(ptrn, x):
    x = x.replace('=', '').strip()

会给你Title1。或者假设你想要列表中所有匹配的标题，你可以-

x = '== Title1   ===nansnsk fnasasklsanlkas lkaslkans \n== Title2 ==='
titles = [i.replace('=', '').strip() for i in re.findall(ptrn, x)]
# OP ['Title1', 'Title2']

图案是-

"^[=]{1,}[\s]{0,}[\w]+[\s]{0,}[=]{1,}"

^[=]{1,} - match at least one equal sign at the start
[\s]{0,} - match between zero to unlimited equal signs
[\w]+ - match [a-zA-Z0-9_] once or more

之后，我们可以通过将=替换为''并将其从空格中剥离来提取文本。您可以在regex101尝试，这在测试regex时非常有用

相关问题更多 >

编程相关推荐

热门问题

热门文章