使用BeautifulSoup在Python中从XML的嵌套标签提取文本
我正在尝试从嵌套的标签中提取文本,比如说,XML的格式是这样的:
<thread id = 1_1>
<post id = 1>
<title>
<ne>MediaPortal</ne> Install Guide
</title>
<content>
<ne>MediaPortal</ne> Install Guide 0. Introduction and pre-requisites
<ne>MediaPortal</ne> is an open-source and free full-fledged <ne>HTPC</ne>
front-end. It does everything you can ask for in a media center: video
playback, music playback, photo viewing, weather, TV tuning and recording,
etc. It has wide community support and thanks to it's excellent plug-in
and skinning framework, there are lots of community-developed extensions
you can pick and choose to make it your own. It is far more configurable
than <ne>Windows Media Center</ne>, and it works out-of-the-box with the
<ne>MCE</ne> remote. And because it provides so much more configuration
some find it a daunting task to install and configure. Therefore, this
guide will help alleviate some of that burden and help get a
<ne>MediaPortal</ne> installation up & running. This guide is not
intended to replace the wonderful <ne>MediaPortal</ne> documentation, but
rather to introduce the AVS community to <ne>MediaPortal</ne> and provide
a quick and easy set-up guide. If you need more details on configuration
</content>
</post>
</thread>
我需要提取标签中的数据,并把它保存到一个单独的文件里。我已经能做到这一点,然后我从Beautiful Soup对象中提取出标签。现在,我想从这些标签中提取文本,并把它放到一个单独的文件里。请给我一些建议,看看我该怎么做。
在从soup对象中提取出标签后,如果我执行
for title in soup.find('title')
print title.string
那么在控制台上,对于在提取标签之前的标题标签,它会显示None。
1 个回答
1
来自 BeautifulSoup
的文档:
For your convenience, if a tag has only one child node,
and that child node is a string,the child node is made
available as tag.string, as well as tag.contents[0].
不过,在你的情况下:
>>> t = soup.find('title')
<title><ne>MediaPortal</ne> Install Guide</title>
所以,在你的情况下,你不能使用 tag.string
。不过,你仍然可以使用 tag.contents
或 tag.text
:
>>> t.contents
[<ne>MediaPortal</ne>, u' Install Guide']
>>> t.text
u'MediaPortalInstall Guide'