如何阻止我的爬虫记录重复?

2024-04-26 22:15:59 发布

您现在位置:Python中文网/ 问答频道 /正文

我想知道如何阻止它多次登录同一个url?你知道吗

这是我目前的代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        item = MyItem()
        item['url'] = link.url
        self.f.write(item['url']+"\n")

现在,它将为一个链接做成千上万的复制,例如,一个有250000个帖子的vBulletin论坛。你知道吗

编辑: 请注意,cralwer将获得数以百万计的链接。 因此,我需要的代码是非常快速的检查。你知道吗


Tags: 代码fromimportselfurlfielddomainitem
1条回答
网友
1楼 · 发布于 2024-04-26 22:15:59

创建一个已经访问过的URL列表并检查每个URL。所以在解析特定的URL之后,将其添加到列表中。在访问新找到的URL上的页面之前,请检查该URL是否已在该列表中,并对其进行分析、添加或跳过。你知道吗

即:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  items=[] #list with your URLs
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        if link not in self.items: #check if it's already parsed
            self.items.append(link)   #add to list if it's not parsed yet
            #do your job on adding it to a file
            item = MyItem()
            item['url'] = link.url
            self.f.write(item['url']+"\n")

词典版本:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class MyItem(Item):
    url=Field()

class someSpider(CrawlSpider):
  name = "My script"
  domain=raw_input("Enter the domain:\n")
  allowed_domains = [domain]
  starting_url=raw_input("Enter the starting url with protocol:\n")
  start_urls = [starting_url]
  items={} #dictionary with your URLs as keys
  f=open("items.txt","w")

  rules = (Rule(LxmlLinkExtractor(allow_domains=(domain)), callback='parse_obj', follow=True),)


  def parse_obj(self,response):
    for link in LxmlLinkExtractor(allow_domains=(self.domain)).extract_links(response):
        if link not in self.items: #check if it's already parsed
            self.items[link]=1  #add to dictionary as key if it's not parsed yet (stored value can be anything)
            #do your job on adding it to a file
            item = MyItem()
            item['url'] = link.url
            self.f.write(item['url']+"\n")

另外,您还可以先收集items,然后将其写入文件。你知道吗

对这段代码还有许多其他的改进,但我把它留给你去研究。你知道吗

相关问题 更多 >