使用BeautifulSoup在Python中从XML的嵌套标签提取文本

2 投票

1 回答

3467 浏览

提问于 2025-04-17 06:49

我正在尝试从嵌套的标签中提取文本，比如说，XML的格式是这样的：

<thread id = 1_1>
  <post id = 1>
    <title>
      <ne>MediaPortal</ne> Install Guide
    </title>
    <content>
      <ne>MediaPortal</ne> Install Guide 0. Introduction and pre-requisites 
      <ne>MediaPortal</ne> is an open-source and free full-fledged <ne>HTPC</ne>
      front-end. It does everything you can ask for in a media center: video 
      playback, music playback, photo viewing, weather, TV tuning and recording, 
      etc. It has wide community support and thanks to it's excellent plug-in 
      and  skinning framework, there are lots of community-developed extensions 
      you can  pick and choose to make it your own. It is far more configurable 
      than <ne>Windows Media Center</ne>, and it works out-of-the-box with the 
      <ne>MCE</ne> remote. And because it provides so much more configuration 
      some find it a daunting task to install and configure. Therefore, this 
      guide will help alleviate some of that burden and help get a 
      <ne>MediaPortal</ne> installation up &amp; running. This guide is not 
      intended to replace the wonderful <ne>MediaPortal</ne> documentation, but 
      rather to introduce the AVS community to <ne>MediaPortal</ne> and provide
      a quick and easy set-up guide. If you need more details on configuration
    </content>
  </post>
</thread>

我需要提取标签中的数据，并把它保存到一个单独的文件里。我已经能做到这一点，然后我从Beautiful Soup对象中提取出标签。现在，我想从这些标签中提取文本，并把它放到一个单独的文件里。请给我一些建议，看看我该怎么做。

在从soup对象中提取出标签后，如果我执行

for title in soup.find('title')
   print title.string

那么在控制台上，对于在提取标签之前的标题标签，它会显示None。

XML 文本处理数据提取 beautifulsoup 文件保存嵌套标签 soup对象

1 个回答

来自 BeautifulSoup 的文档：

For your convenience, if a tag has only one child node,
and that child node is a string,the child node is made
available as tag.string, as well as tag.contents[0].

不过，在你的情况下：

>>> t = soup.find('title')
<title><ne>MediaPortal</ne> Install Guide</title>

所以，在你的情况下，你不能使用 tag.string。不过，你仍然可以使用 tag.contents 或 tag.text：

>>> t.contents
[<ne>MediaPortal</ne>, u' Install Guide']
>>> t.text
u'MediaPortalInstall Guide'

回答于 2025-04-17 由 Python大师

分享举报

使用BeautifulSoup在Python中从XML的嵌套标签提取文本

1 个回答

撰写回答