如何在Python中使用re.sub为特定字符串添加标签？

4 投票

2 回答

3707 浏览

提问于 2025-04-16 07:18

我正在尝试给一些查询字符串添加标签，这些标签应该包裹住所有匹配的字符串。比如，我想在句子中把所有符合查询 iphone games mac 的词用标签包起来。句子 I love downloading iPhone games from my mac. 应该变成 I love downloading iPhone games from my mac.

目前，我尝试了

sentence = "I love downloading iPhone games from my mac."
query = r'((iphone|games|mac)\s*)+'
regex = re.compile(query, re.I)
sentence = regex.sub(r'<em>\1</em> ', sentence)

这个句子的输出是

I love downloading <em>games </em> on my <em>mac</em> !

但是 \1 只替换了一个词（games 而不是 iPhone games），而且在这个词后面还有一些多余的空格。我该怎么写正则表达式才能得到想要的输出呢？谢谢！

编辑： 我刚意识到，Fred 和 Chris 的解决方案在处理包含单词的单词时都有问题。比如，如果我的查询是 game，那么它会变成 games，而我希望它不被高亮。另一个例子是 either 中的 the 也不应该被高亮。

编辑 2： 我采用了 Chris 的新方案，它有效。

正则表达式文本替换字符串匹配空格处理高亮显示单词边界标签处理查询字符串

2 个回答

>>> r = re.compile(r'(\s*)((?:\s*\b(?:iphone|games|mac)\b)+)', re.I)
>>> r.sub(r'\1<em>\2</em>', sentence)
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'

这个额外的组完全包含了加号重复的部分，避免了丢失单词。同时，在单词前面移动空格——但最开始去掉前面的空格——也解决了这个问题。单词边界的判断需要完全匹配它们之间的三个单词。不过，自然语言处理（NLP）是个复杂的事情，还是会有一些情况不按预期工作。

回答于 2025-04-16 由 Python大师

分享举报

首先，要让空格符合你的要求，可以把 \s* 替换成 \s*?，这样就变成了非贪婪模式。

第一个修复：

>>> re.compile(r'(((iphone|games|mac)\s*?)+)', re.I).sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone</em> <em>games</em> from my <em>mac</em>.'

不过，一旦 \s* 变成非贪婪模式，它会把短语拆开，正如你所看到的那样。如果不这样，它会把两个短语组合在一起，像这样：

>>> re.compile(r'(((iPhone|games|mac)\s*)+)').sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone games </em>from my <em>mac</em>.'

我现在还想不出怎么解决这个问题。

另外，我在这里加了一组额外的括号在 + 周围，这样所有的匹配项都能被捕捉到——这就是区别所在。

进一步更新：其实，我想到了一个解决办法。你可以决定是否想要这样。

>>> regex = re.compile(r'((iphone|games|mac)(\s*(iphone|games|mac))*)', re.I)
>>> regex.sub(r'<em>\1</em>', sentence)
'I love downloading <em>iPhone games</em> from my <em>mac</em>.'

更新：考虑到你提到的单词边界，我们只需要在几个地方加上 \b，也就是单词边界匹配器。

>>> regex = re.compile(r'(\b(iphone|games|mac)\b(\s*(iphone|games|mac)\b)*)', re.I)
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone games from my mac')
'I love downloading <em>iPhone games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone gameses from my mac')
'I love downloading <em>iPhone</em> gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney games from my mac')
'I love downloading iPhoney <em>games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhoney gameses from my mac')
'I love downloading iPhoney gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone gameses from my mac')
'I love downloading miPhone gameses from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading miPhone games from my mac')
'I love downloading miPhone <em>games</em> from my <em>mac</em>'
>>> regex.sub(r'<em>\1</em>', 'I love downloading iPhone igames from my mac')
'I love downloading <em>iPhone</em> igames from my <em>mac</em>'

回答于 2025-04-16 由 Python大师

分享举报

如何在Python中使用re.sub为特定字符串添加标签？

2 个回答

撰写回答