使用Python和Scrapy进行递归爬虫

12 投票

7 回答

7783 浏览

提问于 2025-04-16 13:13

我正在使用scrapy这个工具来抓取一个网站。这个网站每页有15个列表项，然后还有一个“下一页”的按钮。现在我遇到了一个问题，就是在我还没把当前页面的所有列表项处理完的时候，程序就已经请求了下一页的链接。下面是我写的爬虫代码：

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    def start_requests(self):
        return [Request(self.start_url, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

            item = il.load_item()
            listing_url = listing.select('...').extract()

            if listing_url:
                yield Request(urlparse.urljoin(response.url, listing_url[0]),
                              meta={'item': item},
                              callback=self.parse_listing_details)

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()
        if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)


    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        return il.load_item()

这些代码行就是问题所在。正如我之前提到的，它们在爬虫还没完成当前页面的抓取时就被执行了。在网站的每一页上，这导致我只处理了15个列表项中的3个，其他的都没能发送到处理流程中。

     if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)

这是我写的第一个爬虫，可能是我设计上的问题，有没有更好的解决办法呢？

数据处理网页抓取 scrapy 爬虫异步处理列表项递归爬虫

7 个回答

请看下面的更新答案，在EDIT 2部分（更新于2017年10月6日）

你使用yield的具体原因是什么呢？yield会返回一个生成器，当你在这个生成器上调用.next()时，它会返回Request对象。

把你的yield语句改成return语句，应该就能正常工作了。

这里有一个生成器的例子：

In [1]: def foo(request):
   ...:     yield 1
   ...:     
   ...:     

In [2]: print foo(None)
<generator object foo at 0x10151c960>

In [3]: foo(None).next()
Out[3]: 1

编辑：

把你的def start_requests(self)函数改成使用follow参数。

return [Request(self.start_url, callback=self.parse_listings, follow=True)]

编辑 2：

从Scrapy v1.4.0开始（发布于2017年5月18日），现在推荐使用response.follow，而不是直接创建scrapy.Request对象。

来自发布说明：

现在有一个新的response.follow方法用于创建请求；这已成为在Scrapy爬虫中创建请求的推荐方式。这个方法让你更容易写出正确的爬虫；response.follow相比直接创建scrapy.Request对象有几个优点：

它可以处理相对网址；

它能正确处理非UTF8页面上的非ascii网址；

除了绝对和相对网址，它还支持选择器；对于<a>元素，它还可以提取它们的href值。

所以，对于上面的提问者，把代码从：

    next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                               'div[@class="next-link"]/a/@href').extract()
    if next_page_url:
        yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                      callback=self.parse_listings)

改成：

    next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                               'div[@class="next-link"]/a/@href')
    if next_page_url is not None:
        yield response.follow(next_page_url, self.parse_listings)

回答于 2025-04-16 由 Python大师

分享举报

有两种方法可以顺序完成这个任务：

在类下面定义一个 listing_url 的列表。
在 parse_listings() 方法里定义 listing_url。

这两种方法的区别只是说法不同。另外，假设你需要获取五个页面的 listing_urls，那么在类里面也要加上 page=1。

在 parse_listings 方法中，只需请求一次数据。把你需要追踪的所有数据放到 meta 里。也就是说，parse_listings 只用来解析“首页”。

一旦你到达了最后一步，就返回你的项目。这整个过程是顺序进行的。

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    listing_url = []
    page = 1

    def start_requests(self):
        return [Request(self.start_url, meta={'page': page}, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

        items = il.load_item()

        # populate the listing_url with the scraped URLs
        self.listing_url.extend(listing.select('...').extract())

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()

        # now that the front page is done, move on to the next listing_url.pop(0)
        # add the next_page_url to the meta data
        return Request(urlparse.urljoin(response.url, self.listing_url.pop(0)),
                            meta={'page': self.page, 'items': items, 'next_page_url': next_page_url},
                            callback=self.parse_listing_details)

    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        items = il.load_item()

        # check to see if you have any more listing_urls to parse and last page
        if self.listing_urls:
            return Request(urlparse.urljoin(response.url, self.listing_urls.pop(0)),
                            meta={'page': self.page, 'items': items, 'next_page_url': response.meta['next_page_url']},
                            callback=self.parse_listings_details)
        elif not self.listing_urls and response.meta['page'] != 5:
            # loop back for more URLs to crawl
            return Request(urlparse.urljoin(response.url, response.meta['next_page_url']),
                            meta={'page': self.page + 1, 'items': items},
                            callback=self.parse_listings)
        else:
            # reached the end of the pages to crawl, return data
            return il.load_item()

回答于 2025-04-16 由 Python大师

分享举报

抓取而不是爬虫？

因为你最初的问题需要反复访问一组连续且重复的内容，而不是一个未知大小的内容树，所以可以使用 mechanize（http://wwwsearch.sourceforge.net/mechanize/）和 beautifulsoup（http://www.crummy.com/software/BeautifulSoup/）。

下面是一个使用 mechanize 创建浏览器的例子。同时，使用 br.follow_link(text="foo") 意味着，与你例子中的 xpath 不同，这个链接会被跟随，无论它在网页结构中的位置如何。这就是说，如果他们更新了 HTML，你的脚本也不会坏掉。这样松散的耦合可以减少维护的麻烦。这里有个例子：

br = mechanize.Browser()
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1)Gecko/20100101 Firefox/9.0.1')]
br.addheaders = [('Accept-Language','en-US')]
br.addheaders = [('Accept-Encoding','gzip, deflate')]
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.open("http://amazon.com")
br.follow_link(text="Today's Deals")
print br.response().read()

另外，在“下一页15”的链接中，可能会有一些指示分页的内容，比如 &index=15。如果在第一页上可以看到所有页面的总项目数，那么：

soup = BeautifulSoup(br.response().read())
totalItems = soup.findAll(id="results-count-total")[0].text
startVar =  [x for x in range(int(totalItems)) if x % 15 == 0]

你只需遍历 startVar，创建网址，把 startVar 的值加到网址上，然后用 br.open() 打开它，抓取数据。这样你就不需要在页面上程序化地“找到”下一页的链接并点击它来进入下一页——你已经知道所有有效的网址了。尽量减少对页面的代码驱动操作，只关注你需要的数据，这样可以加快提取速度。

回答于 2025-04-16 由 Python大师

分享举报

使用Python和Scrapy进行递归爬虫

7 个回答

撰写回答