涉及带属性的HTML标签的Python网络爬虫

8 投票

4 回答

9579 浏览

提问于 2025-04-15 14:08

我正在尝试制作一个网页爬虫，目的是从一个包含出版物的网页中提取作者信息。这个网页的基本结构如下：

<html>
<body>
<div id="container">
<div id="contents">
<table>
<tbody>
<tr>
<td class="author">####I want whatever is located here ###</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>

到目前为止，我一直在尝试使用BeautifulSoup和lxml来完成这个任务，但我不太确定如何处理这两个div标签和td标签，因为它们有一些属性。此外，我也不确定是应该更多依赖BeautifulSoup，还是lxml，或者两者结合使用。请问我该怎么做？

目前，我的代码看起来是这样的：

    import re
    import urllib2,sys
    import lxml
    from lxml import etree
    from lxml.html.soupparser import fromstring
    from lxml.etree import tostring
    from lxml.cssselect import CSSSelector
    from BeautifulSoup import BeautifulSoup, NavigableString

    address='http://www.example.com/'
    html = urllib2.urlopen(address).read()
    soup = BeautifulSoup(html)
    html=soup.prettify()
    html=html.replace('&nbsp', '&#160')
    html=html.replace('&iacute','&#237')
    root=fromstring(html)

我意识到很多导入的语句可能是多余的，但我只是复制了我在其他源文件中当前的内容。

补充说明：我想爬取的网页中有多个标签。

lxml 数据提取 beautifulsoup 网页爬虫 div标签属性处理 HTML标签 td标签

4 个回答

lxml库现在是用Python解析HTML的标准工具。一开始使用的时候可能会觉得界面有点别扭，但它的功能非常强大。

你应该让这个库来处理一些特殊的XML问题，比如那些被转义的&实体。

import lxml.html

html = """<html><body><div id="container"><div id="contents"><table><tbody><tr>
          <td class="author">####I want whatever is located here, eh? &iacute; ###</td>
          </tr></tbody></table></div></div></body></html>"""

root = lxml.html.fromstring(html)
tds = root.cssselect("div#contents td.author")

print tds           # gives [<Element td at 84ee2cc>]
print tds[0].text   # what you want, including the 'í'

回答于 2025-04-15 由 Python大师

分享举报

或者你可以使用pyquery，因为BeautifulSoup现在已经不再积极维护了，具体可以查看这个链接：http://www.crummy.com/software/BeautifulSoup/3.1-problems.html

首先，使用下面的命令安装pyquery：

easy_install pyquery

然后你的脚本可以简单到这个程度：

from pyquery import PyQuery
d = PyQuery('http://mywebpage/')
allauthors = [ td.text() for td in d('td.author') ]

pyquery使用的是和jQuery相似的CSS选择器语法，我觉得这比BeautifulSoup的语法更直观。它底层使用lxml，所以速度比BeautifulSoup快很多。不过，BeautifulSoup是纯Python写的，因此也可以在Google的应用引擎上运行。

回答于 2025-04-15 由 Python大师

分享举报

我不太明白你为什么需要担心 div 标签，你可以直接这样做：

soup = BeautifulSoup(html)
thetd = soup.find('td', attrs={'class': 'author'})
print thetd.string

根据你提供的HTML，运行这个代码会输出：

####I want whatever is located here ###

看起来这正是你想要的。也许你可以更清楚地说明一下你需要什么，这段超级简单的代码可能没法满足你的需求，比如说有多个 td 标签，它们的类都是 author，你需要考虑哪些（全部？还是只要某些？哪些？），如果缺少这样的标签你想怎么处理等等。从这个简单的例子和过多的代码中，很难推测出你的具体要求是什么；-）。

编辑：如果根据提问者最新的评论，有多个这样的 td 标签，每个作者一个：

thetds = soup.findAll('td', attrs={'class': 'author'})
for thetd in thetds:
    print thetd.string

...也就是说，其实并没有那么难！-)

回答于 2025-04-15 由 Python大师

分享举报

涉及带属性的HTML标签的Python网络爬虫

4 个回答

撰写回答