如何在Python中用regex匹配和替换多个字符串

2024-04-19 01:49:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图用regex替换Python中的一些文本。你知道吗

我的文字如下:

WORKGROUP 1. John Doe ID123, Jane Smith ID456, Ohe Keedoke ID7890
Situation paragraph 1

WORKGROUP 2. John Smith ID321, Jane Doe ID654
Situation paragraph 2

我想做的是把名字放在双方括号里,去掉id,这样它就会变成这样。你知道吗

WORKGROUP 1. [[John Doe]], [[Jane Smith]], [[Ohe Keedoke]]
Situation paragraph 1

WORKGROUP 2. [[John Smith]], [[Jane Doe]]
Situation paragraph 2

到目前为止我有这个。你知道吗

re.sub(r"(WORKGROUP\s\d\.\s)",r"\1[[")
re.sub(r"(WORKGROUP\s\d\..+?)(?:\s\b\w+\b),(?:\s)(.+\n)",r"\1]], [[\2")
re.sub(r"(WORKGROUP\s\d\..+?)(?:\s\b\w+\b)(\n)",r"\1]]\2")

这适用于有两个人的组(工作组2),但如果有两个以上的人,则保留除第一个和最后一个以外的所有ID。所以第一工作组最后是这样的。你知道吗

WORKGROUP 1. [[John Doe]], [[Jane Smith ID456, Ohe Keedoke]]
Situation paragraph 1

不幸的是,我不能像这样做

re.sub(r"((\s\b\w+\b),(\s))+",r"\1]], [[\2")

因为它将在情境段落中匹配。你知道吗

我的问题是:有没有可能在一个字符串段中进行多个匹配/替换,而不进行统一的匹配/替换?你知道吗


Tags: 文本rejohn工作组regexsmithdoesituation
3条回答

如果安装了regex模块:

(?<=\bWORKGROUP\s+\d+\.\s|,)\s*(.+?)\s*ID\d+\s*(?=,|$)

可能还可以。你知道吗

如果没有,只需在终端中运行:

$ pip install regex

或者

$ pip3 install regex

在这里,我们假设您的文本中可能存在其他ID\d+,否则,如果您不这样做,您的问题将非常简单。你知道吗

测试

import regex as re

regex = r"(?<=\bWORKGROUP\s+\d+\.\s|,)\s*(.+?)\s*ID\d+\s*(?=,|$)"

test_str = '''

WORKGROUP 1. John Doe ID123, Jane Smith ID456, Ohe Keedoke ID7890
Situation paragraph 1
WORKGROUP 2. John Smith ID321, Jane Doe ID654
Situation paragraph 2

WORKGROUP 11. Bob Doe ID123, Alice Doe ID123, John Doe ID123, Jane Smith ID456, Ohe Keedoke ID7890
Situation paragraph 1

WORKGROUP 21. John Smith ID321, Jane Doe ID654
Situation paragraph 2

'''


subst = "[[\\1]]"

print(re.sub(regex, subst, test_str, 0, re.MULTILINE))

输出

WORKGROUP 1. [[John Doe]],[[Jane Smith]],[[Ohe Keedoke]]
Situation paragraph 1
WORKGROUP 2. [[John Smith]],[[Jane Doe]]
Situation paragraph 2

WORKGROUP 11. [[Bob Doe]],[[Alice Doe]],[[John Doe]],[[Jane Smith]],[[Ohe Keedoke]]
Situation paragraph 1

WORKGROUP 21. [[John Smith]],[[Jane Doe]]
Situation paragraph 2

如果您希望简化/修改/探索表达式,在regex101.com的右上面板中已经对其进行了解释。如果您愿意,还可以在this link中查看它如何与一些示例输入匹配。你知道吗


您可以嵌套替换并使第一个替换首先查找以WORKGROUP开头的行,然后让第二个替换查找并替换其中的公共分隔标记:

re.sub(
    r'^(WORKGROUP\s+\d+\.\s*)(.*)',
    lambda m: m.group(1) + re.sub(r'([^,\s][^,]*)\s+\S+(?=,|$)', r'[[\1]]', m.group(2)),
    text,
    flags=re.MULTILINE
)

因此:

text = '''WORKGROUP 1. John Doe ID123, Jane Smith ID456, Ohe Keedoke ID7890
Situation paragraph 1

WORKGROUP 2. John Smith ID321, Jane Doe ID654
Situation paragraph 2'''

表达式返回:

WORKGROUP 1. [[John Doe]], [[Jane Smith]], [[Ohe Keedoke]]
Situation paragraph 1

WORKGROUP 2. [[John Smith]], [[Jane Doe]]
Situation paragraph 2

演示:https://repl.it/@blhsing/BoldElderlyQuerylanguage

代码

import re

test = """
WORKGROUP 1. John Doe ID123, Jane Smith ID456, Ohe Keedoke ID7890
Situation paragraph 1

WORKGROUP 2. John Smith ID321, Jane Doe ID654
Situation paragraph 2
"""

test = re.sub(' ID[0-9]+, ', ']], [[', test)
test = re.sub('\. ', '. [[', test)
test = re.sub(' ID[0-9]+', ']]', test)
print(test)

输出

WORKGROUP 1. [[John Doe]], [[Jane Smith]], [[Ohe Keedoke]]
Situation paragraph 1

WORKGROUP 2. [[John Smith]], [[Jane Doe]]
Situation paragraph 2

相关问题 更多 >