使用Scrapy刮取json文件的多个URL

2024-05-23 23:09:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个Scrapy项目,它使用json文件来刮取URL。 使用此代码,我只能删除一个URL,当我有两个URL时,我会出错。如何刮这些网址没有错误

import json
import scrapy
import re
import pkgutil

from scrapy.loader import ItemLoader
from rzc_spider.items import AnnonceItem


class AnnonceSpider(scrapy.Spider):
    name = 'rzc_results'

    def __init__(self, *args, **kwargs):
        data_file = pkgutil.get_data("rzc_spider", "json/input/test_tt.json")
        self.data = json.loads(data_file)

    def start_requests(self):
        for item in self.data:
            request = scrapy.Request(item['rzc_url'], callback=self.parse)
            request.meta['item'] = item
            yield request

    def parse(self, response):
        item = response.meta['item']
        item['results'] = []
        item["car_number"] = response.css(
            "h2.sub::text").extract_first()

        for caritem in response.css("div.ad > div[itemtype='https://schema.org/Vehicle']"):
            data = AnnonceItem()
            #model
            data["model"] = caritem.css(
                "em.title::text").extract_first()

            item['results'].append(data)
        yield item

    #ban proxies reaction
    def response_is_ban(self, request, response):
        return b'banned' in response.body

    def exception_is_ban(self, request, exception):
        return None

我的json输入:

[{
    "objectID": 10743,
    "sous_modele2": "TT Coupé",
    "marque": "Audi",
    "type": "Coupé",
    "cars_getroute": "audi-tt-coupe-1999-2006",
    "years": [
        "1999",
        "2000",
        "2001",
        "2002",
        "2003",
        "2004",
        "2005",
        "2006"
    ],
    "rzc_url": ["https://www.website.com/results&page=1",
                "https://www.website.com/results&page=2"]

}]

它只在url上运行良好:

[{
    "objectID": 10743,
    "sous_modele2": "TT Coupé",
    "marque": "Audi",
    "type": "Coupé",
    "cars_getroute": "audi-tt-coupe-1999-2006",
    "years": [
        "1999",
        "2000",
        "2001",
        "2002",
        "2003",
        "2004",
        "2005",
        "2006"
    ],
    "rzc_url": "https://www.website.com/results&page=2"
}]

我知道start_URL,但实际上,我有数千个URL需要使用不同的objectID进行刮取


Tags: httpsimportselfjsonurldataresponserequest