使用Scrapy递归地为链接抓取域

2024-04-16 11:49:03 发布

男 | 程序猿一只，喜欢编程写python代码。

下面是我用来抓取一个域的所有URL的代码：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor

class UrlsSpider(scrapy.Spider):
    name = 'urlsspider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    rules = (Rule(LxmlLinkExtractor(allow=(), unique=True), callback='parse', follow=True))

    def parse(self, response):
        for link in LxmlLinkExtractor(allow_domains=self.allowed_domains, unique=True).extract_links(response):
            print link.url

            yield scrapy.Request(link.url, callback=self.parse)

如您所见，我使用了unique=True，但它仍然在终端中打印重复的url，而我只想要唯一的url而不是重复的url。在

对这件事很有帮助。在

Tags： from import self com true url parse example

1条回答

网友

1楼 · 发布于 2024-04-16 11:49:03

由于代码递归地查看url的内容，您将看到解析其他页面时产生的重复url。实际上，您有多个LxmlLinkExtractor（）实例。在

使用Scrapy递归地为链接抓取域

相关问题更多 >

编程相关推荐

热门问题

热门文章

使用Scrapy递归地为链接抓取域

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >