使用Scrapy创建包含特定关键词所有页面URL的JSON文件

Question

我正在使用scrapy解析多个网址/页面。在每个页面上，它会搜索特定的关键词，如果找到了，就把这个网址添加到一个叫Attribute_Dictionary的字典里。

每解析一个网址，Attribute_Dictionary就会更新一次。现在我想在解析完所有网址后，只写一次这个Attribute_Dictionary的内容到一个json文件里。

目前我写的代码是把内容写入json文件，但它在一次运行中会不断创建新的json文件，覆盖掉上一个文件。
我希望的结果是有一个包含所有网址的json文件，里面是Attribute_Dictionary的内容。

请帮帮我。我是不是应该创建一个全局变量来处理所有解析过的页面？如果是的话，应该怎么做呢？

def parse(self, response):
    global parsed_urls
    global domain_urls
    global tld
    global sliced_url
    items = []
    global item



    if ('html' not in response.headers['Content-Type']):
        return

    sel = Selector(response)


    for h3 in sel.xpath('//title/text()').extract():
         #print h3 + "***********" + ' <' + response.url + '>'
         sliced_url = response.url.split('/')[2]


    for url in sel.xpath('//a/@href').extract():

        if (url.startswith('/')):

            url = 'http://' + sliced_url + url



        if (url in parsed_urls or len(url) > 250):
            continue

        parsed_urls.append(url)

        if tld in url:
            domain_urls.append(url)
            yield Request(url, callback=self.parse)

        #print parsed_urls
        for keyword in Keyword_Dictionary:

            if (url.startswith('http') and (tld in url)):
                if (self.Search_keyword_in_url(keyword, response)):
                    if keyword not in Url_Dictionary:
                        Url_Dictionary[keyword] = []
                    Url_Dictionary[keyword].append(url)
                    #print keyword + " " + "Detected"



        for keyword in Url_Dictionary:

            Attribute_Key = []
            Attribute_Key = Keyword_Dictionary.get(keyword)
            Attribute_Key_Value = Url_Dictionary.get(keyword)
            for key in Attribute_Key:
                if key not in Attribute_Dictionary:
                    Attribute_Dictionary[key] = []
                    print key
                    print "\n"
                    for value in Attribute_Key_Value:
                        if value not in Attribute_Dictionary.get(key):
                            Attribute_Dictionary[key].append(value)
                            print key + " " + "Just Appended"
                            item = Website()
                            Modified_Key = key.replace(" ","_")
                            item[Modified_Key] = response.url
                            print item[Modified_Key]

        print Attribute_Dictionary
        # Json Code
        fileptr = open('keywords_spider.json','a')
        json.dump(Attribute_Dictionary, fileptr, indent=4)
        print "Created keywords_spiders.json.."
        fileptr.close()




def Search_keyword_in_url(self, keyword, response):

    sel = Selector(response)

    text_list = sel.xpath('//div/p/text()').extract()
    for text in text_list:
        if text.find(keyword) > -1:
            return True
    return False

全局变量数据解析 json 文件写入 scrapy 关键词搜索字典更新 URL收集

使用Scrapy创建包含特定关键词所有页面URL的JSON文件

1 个回答

撰写回答