捕获标签的最后一次出现

0 投票

4 回答

2355 浏览

提问于 2025-04-16 00:22

我的文本格式是这样的：

<Story>
 <Sentence id="1"> some text </Sentence>   
 <Sentence id="2"> some text </Sentence>   
 <Sentence id="3"> some text </Sentence>

我的任务是在最后一个 </Sentence> 后面插入一个结束标签 </Story>。在文本中，每个 </Sentence> 后面都有三个空格。我尝试用正则表达式 </Sentence>(?!.*<Sentence) 来捕捉最后一个 </Sentence>，并且也用了 re.DOTALL 选项。但是它没有成功。

我实际使用的代码是：
line = re.sub(re.compile('</Sentence>(?!.*<Sentence)',re.DOTALL),'</Sentence></Story>',line)

请帮帮我。谢谢。

正则表达式文本处理 re模块编程问题标签插入

4 个回答

为什么不把所有的（或者说有多少个）<Sentence> 元素都匹配出来，然后用一个组引用把它们放回去呢？

re.sub(r'(?:(\r?\n) *<Sentence.*?</Sentence> *)+',
       r'$0$1</Story>',
       line)

回答于 2025-04-16 由 Python大师

分享举报

你真的应该使用像BeautifulSoup这样的解析器来完成这个任务。BeautifulSoup可以处理那些格式很糟糕的HTML/XML，并试图把它们变得正确。你的代码可以这样写（我假设在你错误的Story标签前后还有其他标签，否则你应该参考David的评论）:

from BeautifulSoup import BeautifulStoneSoup

html = '''
<Document>
<PrevTag></PrevTag>
<Story>
 <Sentence id="1"> some text </Sentence>   
 <Sentence id="2"> some text </Sentence>   
 <Sentence id="3"> some text </Sentence>
<EndTag></EndTag>
</Document> 
'''
# Parse the document:
soup = BeautifulStoneSoup(html)

看看BeautifulSoup是怎么解析的：

print soup.prettify()

#<document>
# <prevtag>
# </prevtag>
# <story>
#  <sentence id="1">
#   some text
#  </sentence>
#  <sentence id="2">
#   some text
#  </sentence>
#  <sentence id="3">
#   some text
#  </sentence>
#  <endtag>
#  </endtag>
# </story>
#</document>

注意到BeautifulSoup在关闭它的外层标签（Document）之前就关闭了Story标签，所以你需要把关闭标签移动到最后一句话旁边。

# Find the last sentence:
last_sentence = soup.findAll('sentence')[-1]

# Find the Story tag:
story = soup.find('story')

# Move all tags after the last sentence outside the Story tag:
sib = last_sentence.nextSibling
while sib:
    story.parent.append(sib.extract())
    sib = last_sentence.nextSibling

print soup.prettify()

#<document>
# <prevtag>
# </prevtag>
# <story>
#  <sentence id="1">
#   some text
#  </sentence>
#  <sentence id="2">
#   some text
#  </sentence>
#  <sentence id="3">
#   some text
#  </sentence>
# </story>
# <endtag>
# </endtag>
#</document>

最终的结果应该正是你想要的。请注意，这段代码假设文档中只有一个Story，如果不止一个，可能需要稍微修改一下。祝你好运！

回答于 2025-04-16 由 Python大师

分享举报

如果同一段代码生成了整个文件，那就用一个专门处理XML的库来生成，这样所有的标签就会正确地嵌套在一起。如果不是，那就需要修正生成文件的代码，确保它是有效的XML。

正则表达式和XML不太搭配。

回答于 2025-04-16 由 Python大师

分享举报

捕获标签的最后一次出现

4 个回答

撰写回答