爬虫运行两次时产生重复数据？

1 投票

1 回答

940 浏览

提问于 2025-04-16 13:46

我在Python中使用一个叫“scrapy”的爬虫框架，并且我用pipelines.py文件把抓取到的内容存储为json格式的文件。下面是实现这个功能的代码：

class AYpiPipeline(object):
def __init__(self):
    self.file = open("a11ypi_dict.json","ab+")


# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
    d = {}    
    i = 0
# Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
try:
    while i<len(item["foruri"]):
        d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
    i+=1
except IndexError:
    print "Index out of range"
    # Writing it to a file
    json.dump(d,self.file)
return item

问题是，当我运行爬虫两次（比如说）时，我的文件里会出现重复的抓取内容。我尝试通过先读取文件，然后把读取到的数据和新数据进行比较来避免这个问题，但从文件中读取的数据是json格式的，所以我用json.loads()函数来解码，但这样并没有成功：

import json 

class AYpiPipeline(object):
    def __init__(self):
        self.file = open("a11ypi_dict.json","ab+")
        self.temp = json.loads(file.read())
    
    # this method is called to process an item after it has been scraped.
    def process_item(self, item, spider):
        d = {}    
        i = 0
        # Here we are iterating over the scraped items and creating a dictionary of    dictionaries.
        try:
            while i<len(item["foruri"]):
            d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" + item["thisid"][i]
            i+=1
        except IndexError:
            print "Index out of range"
        # Writing it to a file
    
             if d!=self.temp: #check whether the newly generated data doesn't match the one already in the file
                  json.dump(d,self.file)
        return item
    .

请给我建议一个解决方法。

注意：我必须以“追加”模式打开文件，因为我可能会抓取不同的链接，但如果两次运行爬虫使用相同的起始网址，那么应该把相同的数据写入文件两次。

文件操作数据存储 json处理数据比较数据去重爬虫 scrapy框架爬虫优化

1 个回答

你可以通过一些自定义的中间件来过滤掉重复的数据，比如这个。不过，要在你的爬虫中真正使用这个，你还需要两个东西：一种给项目分配ID的方法，这样过滤器才能识别重复项，以及一种在爬虫运行之间保存已访问ID集合的方法。第二个比较简单——你可以用一些Python中常用的工具，比如shelve，或者使用现在流行的许多键值存储。第一部分会比较难，具体要看你想解决的问题是什么。

回答于 2025-04-16 由 Python大师

分享举报

爬虫运行两次时产生重复数据？

1 个回答

撰写回答