在两个子字符串之间找到一个字符串,但第一个的结尾是下一个的开头
我有一个这样的字符串:
...<p><noop><fademusic:23,0><26:1><wait:30> <speed:10><30:2><5D:1><color:3>August 3, 9:47 AM<b>District Court<b>Defendant Lobby No. 2<color:0><p><hidetextbox:1><5D:0> <speed:255><music:8,0><wait:30><26:0><bgcolor:513,1,31><wait:7> <person:0,0,0><bg:2><bgcolor:258,1,31><wait:15><wait:30><hidetextbox:0> <name:512><shake:30,0><color:2>(Boy am I nervous!)<color:0><p> <hidetextbox:1><wait:45><name:1792><hidetextbox:0><bgcolor:769,8,31> Wright!<p>...
我需要做什么:找到所有在<p>标签之间的内容。(注意,结束的标签也是下一个内容的开始标签。)
我的代码:
...
filetext = open(fn).read()
tag = '<p>'
result = re.findall(tag+"(.*?)"+tag,filetext,re.DOTALL)
print(result)
...
预期的输出:
['<noop><fademusic:23,0><26:1><wait:30>\n<speed:10><30:2><5D:1><color:3>August 3, 9:47 AM<b>District \nCourt<b>Defendant Lobby No. 2<color:0>', '<hidetextbox:1><5D:0>\n<speed:255><music:8,0><wait:30><26:0><bgcolor:513,1,31><wait:7>\n<person:0,0,0><bg:2><bgcolor:258,1,31><wait:15><wait:30><hidetextbox:0>\n<name:512><shake:30,0><color:2>(Boy am I nervous!)<color:0>', '\n<hidetextbox:1><wait:45><name:1792><hidetextbox:0><bgcolor:769,8,31>\nWright!']
实际的输出:
['<noop><fademusic:23,0><26:1><wait:30>\n<speed:10><30:2><5D:1><color:3>August 3, 9:47 AM<b>District \nCourt<b>Defendant Lobby No. 2<color:0>', '\n<hidetextbox:1><wait:45><name:1792><hidetextbox:0><bgcolor:769,8,31>\nWright!']
2 个回答
1
我对你的代码做了一些小改动。我使用了一个叫做前瞻断言的东西 (?=
,它可以用来匹配下一个 <p>
标签或者字符串的结尾 $
。这样做的目的是确保这个正则表达式能够捕捉到文本,直到下一个 <p>
标签出现,或者一直到字符串的结尾。如果下一个 <p>
标签存在,它就会停在那儿;如果没有,就会一直到最后。下面是更新后的代码:
import re
filetext = open(fn).read()
tag = '<p>'
result = re.findall(tag + "(.*?)(?=" + tag + "|$)", filetext, re.DOTALL)
print(result)
0
其实不需要用到 re
模块,只要用 str.split('<p>')
就可以了。如果你的字符串是以 <p>
开头或结尾的,可能会出现空字符串,这里有个解决办法:
s = '<p><noop><fademusic:23,0><26:1><wait:30> <speed:10><30:2><5D:1><color:3>August 3, 9:47 AM<b>District Court<b>Defendant Lobby No. 2<color:0><p><hidetextbox:1><5D:0> <speed:255><music:8,0><wait:30><26:0><bgcolor:513,1,31><wait:7> <person:0,0,0><bg:2><bgcolor:258,1,31><wait:15><wait:30><hidetextbox:0> <name:512><shake:30,0><color:2>(Boy am I nervous!)<color:0><p> <hidetextbox:1><wait:45><name:1792><hidetextbox:0><bgcolor:769,8,31> Wright!<p>'
result = s.split('<p>')
for n in (0, -1):
if result and not result[n]:
del result[n]
print(result)
输出结果:
['<noop><fademusic:23,0><26:1><wait:30> <speed:10><30:2><5D:1><color:3>August 3, 9:47 AM<b>District Court<b>Defendant Lobby No. 2<color:0>', '<hidetextbox:1><5D:0> <speed:255><music:8,0><wait:30><26:0><bgcolor:513,1,31><wait:7> <person:0,0,0><bg:2><bgcolor:258,1,31><wait:15><wait:30><hidetextbox:0> <name:512><shake:30,0><color:2>(Boy am I nervous!)<color:0>', ' <hidetextbox:1><wait:45><name:1792><hidetextbox:0><bgcolor:769,8,31> Wright!']
如果你不想要 任何 空字符串,比如说 'abc<p><p>def'
这样会返回 ['abc', '', 'def']
,那么可以使用:
result = [n for n in s.split('<p>') if n]