如何避免爬虫中的重复内容
我用Python的scrapy框架写了一个爬虫,目的是选择一些链接和元标签。然后它会爬取起始网址,并把数据以JSON格式写入文件里。问题是,当我用相同的起始网址运行爬虫两三次时,文件里的数据会重复。为了避免这个问题,我在scrapy中使用了一个下载中间件,具体代码可以在这里找到:http://snippets.scrapy.org/snippets/1/
我做的就是把上面的代码复制粘贴到我的scrapy项目里的一个文件中,然后在settings.py文件中启用了它,添加了以下这一行:
SPIDER_MIDDLEWARES = {'a11ypi.removeDuplicates.IgnoreVisitedItems':560}
这里的"a11ypi.removeDuplicates.IgnoreVisitedItems"是类的路径名。最后,我还修改了我的items.py文件,添加了以下字段:
visit_id = Field()
visit_status = Field()
但是这样做并没有效果,爬虫在运行两次时仍然会把相同的结果追加到文件里。
我在pipelines.py文件中写入文件的代码如下:
import json
class AYpiPipeline(object):
def __init__(self):
self.file = open("a11ypi_dict.json","ab+")
# this method is called to process an item after it has been scraped.
def process_item(self, item, spider):
d = {}
i = 0
# Here we are iterating over the scraped items and creating a dictionary of dictionaries.
try:
while i<len(item["foruri"]):
d.setdefault(item["foruri"][i],{}).setdefault(item["rec"][i],{})[item["foruri_id"][i]] = item['thisurl'] + ":" +item["thisid"][i]
i+=1
except IndexError:
print "Index out of range"
json.dump(d,self.file)
return item
而我的爬虫代码如下:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from a11ypi.items import AYpiItem
class AYpiSpider(CrawlSpider):
name = "a11y.in"
allowed_domains = ["a11y.in"]
# This is the list of seed URLs to begin crawling with.
start_urls = ["http://www.a11y.in/a11ypi/idea/fire-hi.html"]
# This is the callback method, which is used for scraping specific data
def parse(self,response):
temp = []
hxs = HtmlXPathSelector(response)
item = AYpiItem()
wholeforuri = hxs.select("//@foruri").extract() # XPath to extract the foruri, which contains both the URL and id in foruri
for i in wholeforuri:
temp.append(i.rpartition(":"))
item["foruri"] = [i[0] for i in temp] # This contains the URL in foruri
item["foruri_id"] = [i.split(":")[-1] for i in wholeforuri] # This contains the id in foruri
item['thisurl'] = response.url
item["thisid"] = hxs.select("//@foruri/../@id").extract()
item["rec"] = hxs.select("//@foruri/../@rec").extract()
return item
请给我一些建议该怎么做。
1 个回答
1
试着理解一下这个代码片段为什么是这样写的:
if isinstance(x, Request):
if self.FILTER_VISITED in x.meta:
visit_id = self._visited_id(x)
if visit_id in visited_ids:
log.msg("Ignoring already visited: %s" % x.url,
level=log.INFO, spider=spider)
visited = True
注意第二行,你需要在请求的元数据中有一个叫做 FILTER_VISITED
的键,这样中间件才能放弃这个请求。这样做的原因是很好的,因为如果你不这样做,你之前访问过的每一个网址都会被跳过,这样你就根本没有网址可以处理了。所以,FILTER_VISITED
其实是让你选择想要跳过哪些网址模式。如果你想跳过用特定规则提取的链接,只需这样做:
Rule(SgmlLinkExtractor(allow=('url_regex1', 'url_regex2' )), callback='my_callback', process_request = setVisitFilter)
def setVisitFilter(request):
request.meta['filter_visited'] = True
return request
附言:我不知道这个在0.14及以上版本是否有效,因为一些代码在存储爬虫上下文到sqlite数据库时已经改变了。