多个正则表达式搜索和替换

0 投票

2 回答

647 浏览

提问于 2025-04-17 13:22

我正在尝试创建一个简单的脚本，它可以从一个文件中读取正则表达式，然后在另一个文件中进行搜索和替换。现在我写的代码不能正常工作，文件没有任何变化，我到底哪里出错了呢？

import re, fileinput

separator = ' => '

file = open("searches.txt", "r")

for search in file:
    pattern, replacement = search.split(separator)
    pattern = 'r"""' + pattern + '"""'
    replacement = 'r"""' + replacement + '"""'
    for line in fileinput.input("test.txt", inplace=1):
        line = re.sub(pattern, replacement, line)
        print(line, end="")

文件 searches.txt 的内容是这样的：

<p (class="test">.+?)</p> => <h1 \1</h1>
(<p class="not">).+?(</p>) => \1This was changed by the script\2

而 test.txt 的内容是这样的：

<p class="test">This is an element with the test class</p>
<p class="not">This is an element without the test class</p>
<p class="test">This is another element with the test class</p>

我做了一个测试，看看它是否能正确读取文件中的表达式：

>>> separator = ' => '
>>> file = open("searches.txt", "r")
>>> for search in file:
...     pattern, replacement = search.split(separator)
...     pattern = 'r"""' + pattern + '"""'
...     replacement = 'r"""' + replacement + '"""'
...     print(pattern)
...     print(replacement)
... 
r"""<p (class="test">.+?)</p>"""
r"""<h1 \1</h1>
"""
r"""(<p class="not">).+?(</p>)"""
r"""\1This was changed by the script\2"""

第一个 replacement 的结束三重引号出现在了新的一行上，不知道这会不会是我问题的原因？

正则表达式搜索与替换数据处理脚本编写文件处理编程调试文本分析

2 个回答

有两个观察结果：

1) 在读取文件时，使用 .strip()，像这样：

pattern, replacement = search.strip().split(separator)

这样可以去掉文件中的 \n 换行符。

2) 如果你想要处理正则表达式中的特殊字符，建议使用 re.escape()，而不是你现在用的 r"""+ str +""" 这种写法。

回答于 2025-04-17 由 Python大师

分享举报

你不需要

pattern = 'r"""' + pattern + '"""'

在调用 re.sub 的时候，pattern 应该是实际的正则表达式。所以应该写成 <p (class="test">.+?)</p>。如果你把这些双引号都包裹起来，就会导致这个模式永远无法匹配到你文件中的文本。

即使你似乎见过这样的代码：

replaced = re.sub(r"""\w+""", '-')

在这种情况下，r""" 是告诉 Python 解释器你在说的是一个“原始”的多行字符串，或者说是一个不应该替换反斜杠序列的字符串（比如 \n 不会被替换成换行符）。程序员通常在 Python 中使用“原始”字符串来引用正则表达式，因为他们想用正则表达式的序列（像上面的 \w）而不需要再转义反斜杠。如果不使用原始字符串，正则表达式就得写成 '\\w+'，这会让人感到困惑。

不过无论如何，你根本不需要使用三重双引号。最后那段代码可以简单写成：

replaced = re.sub(r'\w+', '-')

最后，你的另一个问题是你的输入文件中有换行符，把每个模式和替换分开。所以实际上是“pattern => replacement\n”，而结尾的换行符跟在你的替换变量后面。试试这样做：

for search in file:
    search = search.rstrip() #Remove the trailing \n from the input
    pattern, replacement = search.split(separator)

回答于 2025-04-17 由 Python大师

分享举报

多个正则表达式搜索和替换

2 个回答

撰写回答