使用Python正则表达式从字符串中提取不在HTML标签中的单词
我想从一段文字中搜索一个词,这段文字里还有HTML代码。但我只想找出简单文本中的那个词,而不包括在HTML标签里的。
比如说:
post_content = """I have a question about xyz.
I have a question about xyz .
I have a question about xyz?
I have a question about <a href="hello">xyz</a>.
I have a question about <a href="hello">abc xyz</a>
xyz
*xyz"""
我不想从 <a></a>
中找到 xyz。
请给我一个正则表达式,我试过 [^<.+?>]xyz
。
看看这个演示: 演示
更新的代码
post_content = <above string>
keyword = "xyz"
pattern = r"(?!((?!<).)*<\/)%s" % keyword
replace = "<a href='#'>xyz</a>"
post_content = re.sub(pattern, replace, post_content)
print "post_content", post_content
2 个回答
2
只需要使用一种叫做“负向前瞻”的技巧,就可以找到所有不在标签里的 xyz
字符串。
xyz(?![^<>]*<\/)
>>> import re
>>> s = """I have a question about xyz.
... I have a question about xyz .
... I have a question about xyz?
... I have a question about <a href="hello">xyz</a>.
... I have a question about <a href="hello">abc xyz</a>
... xyz
... *xyz"""
>>> m = re.findall(r'xyz(?![^<>]*<\/)', s)
>>> for i in m:
... print i
...
xyz
xyz
xyz
xyz
xyz
2