让Scrapy递归地翻页

3 投票

1 回答

2915 浏览

提问于 2025-04-30 22:42

我正在尝试使用 scrapy 从这个页面抓取数据。我可以成功抓取到页面上的数据，但我想要抓取其他页面的数据（就是那些显示“下一页”的页面）。这是我代码中相关的部分：

def parse(self, response):
    item = TimemagItem()
    item['title']= response.xpath('//div[@class="text"]').extract()
    links = response.xpath('//h3/a').extract()
    crawledLinks=[]
    linkPattern = re.compile("^(?:ftp|http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&amp;%@!\-\/\(\)]+))?$")

    for link in links:
        if linkPattern.match(link) and not link in crawledLinks:
            crawledLinks.append(link)
        yield Request(link, self.parse)

    yield item

我得到了正确的信息：来自链接页面的标题，但它就是没有“导航”。我该如何告诉 scrapy 去导航呢？

暂无标签

1 个回答

看看Scrapy的链接提取器文档。这是告诉你的爬虫如何跟踪页面上链接的正确方法。

根据你想要抓取的页面，我觉得你应该使用两个提取规则。下面是一个简单的爬虫示例，包含适合你TIMES网页需求的规则：

from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

class TIMESpider(CrawlSpider):
    name = "time_spider"
    allowed_domains = ["time.com"]
    start_urls = [
        'http://search.time.com/results.html?N=45&Ns=p_date_range|1&Ntt=&Nf=p_date_range%7cBTWN+19500101+19500130'
    ]

    rules = (
        Rule (SgmlLinkExtractor(restrict_xpaths=('//div[@class="tout"]/h3/a',))
            , callback='parse'),
        Rule (SgmlLinkExtractor(restrict_xpaths=('//a[@title="Next"]',))
            , follow= True),
        ) 

    def parse(self, response):
        item = TimemagItem()
        item['title']= response.xpath('.//title/text()').extract()

        return item

回答于 2025-04-30 由 Python大师

分享举报

让Scrapy递归地翻页

1 个回答

撰写回答