使用Beautiful Soup解析URL以获取其他URL的数据

30 投票

4 回答

61188 浏览

提问于 2025-04-16 08:37

我需要解析一个网址，以获取一系列链接到详细页面的网址。然后，我还需要从那个详细页面获取所有的信息。我之所以这样做，是因为详细页面的网址并不是规律性增加的，它会变化，而事件列表页面是保持不变的。

简单来说：

example.com/events/
    <a href="http://example.com/events/1">Event 1</a>
    <a href="http://example.com/events/2">Event 2</a>

example.com/events/1
    ...some detail stuff I need

example.com/events/2
    ...some detail stuff I need

4 个回答

使用urllib2来获取网页，然后用Beautiful Soup来提取链接列表，也可以试试scraperwiki.com。

补充：

最近发现：通过lxml使用BeautifulSoup，效果比单独使用BeautifulSoup好得多。

from lxml.html.soupparser import fromstring

它让你可以使用dom.cssselect('你的选择器')，这真是个救命稻草。只要确保你安装了一个好的BeautifulSoup版本，3.2.1就很好用。

dom = fromstring('<html... ...')
navigation_links = [a.get('href') for a in htm.cssselect('#navigation a')]

回答于 2025-04-16 由 Python大师

分享举报

对于下一个看到这个的人来说，BeautifulSoup已经升级到4.0版本了，因为3.0版本不再更新了。

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

在Python中使用...

import bs4 as BeautifulSoup

回答于 2025-04-16 由 Python大师

分享举报

import urllib2
from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen('http://yahoo.com').read()
soup = BeautifulSoup(page)
soup.prettify()
for anchor in soup.findAll('a', href=True):
    print anchor['href']

这段代码会给你一系列的网址。接下来，你可以一个一个地处理这些网址，提取里面的数据。

inner_div = soup.findAll("div", {"id": "y-shade"}) 这是一个示例。你可以去看看BeautifulSoup的教程，了解更多内容。

回答于 2025-04-16 由 Python大师

分享举报

使用Beautiful Soup解析URL以获取其他URL的数据

4 个回答

撰写回答