如何避免爬虫中的重复内容

Question

我用Python的scrapy框架写了一个爬虫，目的是选择一些链接和元标签。然后它会爬取起始网址，并把数据以JSON格式写入文件里。问题是，当我用相同的起始网址运行爬虫两三次时，文件里的数据会重复。为了避免这个问题，我在scrapy中使用了一个下载中间件，具体代码可以在这里找到：http://snippets.scrapy.org/snippets/1/

我做的就是把上面的代码复制粘贴到我的scrapy项目里的一个文件中，然后在settings.py文件中启用了它，添加了以下这一行：

SPIDER_MIDDLEWARES = {'a11ypi.removeDuplicates.IgnoreVisitedItems':560}

这里的"a11ypi.removeDuplicates.IgnoreVisitedItems"是类的路径名。最后，我还修改了我的items.py文件，添加了以下字段：

visit_id = Field()  
visit_status = Field()

但是这样做并没有效果，爬虫在运行两次时仍然会把相同的结果追加到文件里。

我在pipelines.py文件中写入文件的代码如下：

import json 

class AYpiPipeline(object):  
    def __init__(self):  
    self.file = open("a11ypi_dict.json","ab+")


   # this method is called to process an item after it has been scraped.


    def process_item(self, item, spider):
    d = {}  

    i = 0
    # Here we are iterating over the scraped items and creating a dictionary of dictionaries.
    try:
        while i<len(item["foruri"]):
        d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" +item["thisid"][i]
        i+=1
    except IndexError:
        print "Index out of range"

    json.dump(d,self.file)
        return item

而我的爬虫代码如下：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from a11ypi.items import AYpiItem

class AYpiSpider(CrawlSpider):
    name = "a11y.in"
    allowed_domains = ["a11y.in"]

    # This is the list of seed URLs to begin crawling with.
    start_urls = ["http://www.a11y.in/a11ypi/idea/fire-hi.html"]

    # This is the callback method, which is used for scraping specific data
    def parse(self,response):
    temp = []
    hxs = HtmlXPathSelector(response)
    item = AYpiItem()
    wholeforuri = hxs.select("//@foruri").extract()            # XPath to extract the foruri, which contains both the URL and id in foruri
    for i in wholeforuri:
        temp.append(i.rpartition(":"))

    item["foruri"] = [i[0] for i in temp]     # This contains the URL in foruri
    item["foruri_id"] = [i.split(":")[-1] for i in wholeforuri]  # This contains the id in foruri
    item['thisurl'] = response.url                                  
    item["thisid"] = hxs.select("//@foruri/../@id").extract()
    item["rec"] = hxs.select("//@foruri/../@rec").extract()
    return item

请给我一些建议该怎么做。

json格式数据去重 scrapy 爬虫 items.py 爬虫优化下载中间件 pipelines.py

如何避免爬虫中的重复内容

1 个回答

撰写回答