通过下一页链接爬取

-1 投票

1 回答

731 浏览

提问于 2025-04-28 10:17

我正在写一个爬虫程序，用来提取维基百科上的食谱链接。根据我的实现，我该如何继续添加链接，直到到达最后一页呢？注意：下一页的链接标题是“下一页 200”。

这些链接可以在这里找到： http://en.wikibooks.org/wiki/Category:Recipes

def fetch_links(self, proxy):
    """Extracts filtered recipe href links

    Args:
      proxy: The configured proxy address.

    Raises:
      ValueError: If proxy is not a valid address.

    """
    if not self._valid_proxy(proxy):
        raise ValueError('invalid proxy address: {}'.format(proxy))
    self.browser.set_proxies({'http': proxy})
    page = self.browser.open(self.wiki_recipes)
    html = page.read()

    link_tags = SoupStrainer('a', href=True)
    soup = BeautifulSoup(html, parse_only=link_tags)
    recipe_hrefs = r'^\/wiki\/Cookbook:(?!recipes|table_of_contents).*$'
    return [link['href'] for link in soup.find_all(
        'a', href=re.compile(recipe_hrefs, re.IGNORECASE))]

暂无标签

1 个回答

根据我在你评论中提到的方法，这里有一个使用 urllib 和 re 的代码示例，这些技巧可以在你的代码中重复使用。

你需要创建一个函数，传入网址作为参数，最开始传入起始网址，然后用正则表达式抓取所有的食谱链接，并把它们添加到一个全局列表中。接着获取(接下来的200)个链接作为参数，再次调用同样的函数。使用 try/except 来处理可能出现的错误，并导出这个列表。

因为你有一个类，但代码没有显示出来，所以我会跳过所有 class 和 proxy 的部分，下面开始：

#!/usr/bin/python

import urllib
import re

base_url = 'http://en.wikibooks.org/wiki/Category:Recipes'
next_base = 'http://en.wikibooks.org'
recipes = []

# this is just the sample function
# you should handle your proxy logic here too
def get_links(url):
    request = urllib.urlopen(url)
    content = request.read()
    # I just use one-off re expression
    links = re.findall(r'/wiki/Cookbook:(?!Recipes)(?!Table_of_Contents).*" ', content)

    global recipes
    recipes += links

    try:
        # again, one-off re expression
        next_url = re.findall(r'/w/index.*>next 200', content)[0].split('\" ')[0]
        print "fetching next url: " + str(next_base + next_url)
        return get_links(next_base + next_url)
    except IndexError:
        print "all recipes fetched."
        print recipes
        return

if __name__ == '__main__':
    print "start fetching..."
    get_links(base_url)

希望你能掌握所需的技巧。

回答于 2025-04-28 由 Python大师

分享举报

通过下一页链接爬取

1 个回答

撰写回答