使用Python正则表达式不匹配括号内的单词边界
我实际上有:
regex = r'\bon the\b'
但是我需要我的正则表达式只在这个关键词(实际上是“on the”)不在文本中的括号里时才匹配:
应该匹配:
john is on the beach
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
不应该匹配:
(my son is )on the beach
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)
3 个回答
0
在UNIX系统中,使用grep工具时,下面这个正则表达式就足够了。
grep " on the " input_file_name | grep -v "\(.* on the .*\)"
0
你可以试试这样的写法:^(.*)(?:\(.*\))(.*)$
看看效果.
正如你所要求的,它“只匹配文本中不在括号里的单词”。
比如,从下面这段文字:
一些文本(括号里的更多文本)和一些不在括号里的文本
匹配到的内容是:一些文本
+ 和一些不在括号里的文本
更多例子可以在上面的链接中找到。
编辑:因为问题有变动,所以我修改了答案。
为了捕捉所有不在括号里的提及,我会用一些代码,而不是一个复杂的正则表达式。
像这样的写法会比较接近你想要的结果:
import re
pattern = r"(on the)"
test_text = '''john is on the bich
let me put this on the fridge
he (my son) is on the beach
arnold is on the road (to home)
(my son is )on the bitch
john is at the beach
bob is at the pool (berkeley)
the spon (is on the table)'''
match_list = test_text.split('\n')
for line in match_list:
print line, "->",
bracket_pattern = r"(\(.*\))" #remove everything between ()
brackets = re.findall(bracket_pattern, line)
for match in brackets:
line = line.replace(match,"")
matches = re.findall(pattern, line)
for match in matches:
print match
print "\r"
输出结果:
john is on the bich -> on the
let me put this on the fridge -> on the
he (my son) is on the beach -> on the
arnold is on the road (to home) -> on the
(my son is )on the bitch -> on the (this in the only one that doesn't work)
john is at the beach ->
bob is at the pool (berkeley) ->
the spon (is on the table) ->
0
我觉得正则表达式在这个情况下可能帮不上忙,特别是如果你想处理更一般的情况。
((?<=[^\(\)].{3})\bon the\b(?=.{3}[^\(\)])
说明:
(?<=[^\(\)].{3}) Positive Lookbehind - Assert that the regex below
can be matched
[^\(\)] match a single character not present in the list below
\( matches the character ( literally
\) matches the character ) literally
.{3} matches any character (except newline)
Quantifier: Exactly 3 times
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
on the matches the characters on the literally (case sensitive)
\b assert position at a word boundary (^\w|\w$|\W\w|\w\W)
(?=.{3}[^\(\)]) Positive Lookahead - Assert that the regex below
can be matched
.{3} matches any character (except newline)
Quantifier: Exactly 2 times
[^\(\)] match a single character not present in the list below
\( matches the character ( literally
\) matches the character ) literally
如果你想把问题推广到括号之间的任何字符串和你要查找的字符串,这个正则表达式就不适用了。问题在于括号之间的字符串长度和你要查找的字符串。正则表达式中的向后查找(Lookbehind)量词是不允许不定的。
在我的正则表达式中,我使用了正向查找(Lookahead)和向后查找,使用负向查找也能得到相同的结果,但问题依然存在。
建议:写一段小的Python代码,检查整行文本中是否包含你要找的内容,特别是当它不在括号内时,因为单靠正则表达式是无法完成这个任务的。
例子:
import re
mystr = 'on the'
unWanted = re.findall(r'\(.*'+mystr+'.*\)|\)'+mystr, data) # <- here you put the un-wanted string series, which is easy to define with regex
# delete un-wanted strings
for line in mylist:
for item in unWanted:
if item in line:
mylist.remove(line)
# look for what you want
for line in mylist:
if mystr in line:
print line
其中:
mylist: a list contains all the lines you want to search through.
mystr: the string you want to find.
希望这能帮到你。