Python Scrapy 始终有效的函数

0 投票
1 回答
161 浏览
提问于 2025-04-13 02:36

下面的脚本在收集天气数据时大约有90%的成功率。不过,有一些情况下,它会莫名其妙地失败,而这些失败的网页代码和其他请求是一样的。有时候,代码是一样的,请求也是一样的,但就是不成功。

class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
# start_urls = ['http://nflweather.com/']

def __init__(self, Week='', Year='', Game='', **kwargs):
    self.start_urls = [f'https://nflweather.com/{Week}/{Year}/{Game}']  # py36
    self.Year = Year
    self.Game = Game
    super().__init__(**kwargs)
    print(self.start_urls)  # python3

def parse(self, response):
    self.log(self.start_urls)
    #self.log(self.Year)
    # pass
    # Extracting the content using css selectors
    # Extracting the content using css selectors
    game_boxes = response.css('div.game-box')

    for game_box in game_boxes:
        # Extracting date and time information
        Datetimes = game_box.css('.col-12 .fw-bold::text').get()

        # Extracting team information
        team_game_boxes = game_box.css('.team-game-box')
        awayTeams = team_game_boxes.css('.fw-bold::text').get()
        homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
        # Extracting temperature and probability information
        TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()

        # Extracting wind speed information
        windspeeds = game_box.css('.icon-weather + span::text').get()
        winddirection = game_box.css('.md-18 ::text').get()
        # Create a dictionary to store the scraped info
        scraped_info = {
            'Year': self.Year,
            'Game': self.Game,
            'Datetime': Datetimes.strip(),
            'awayTeam': awayTeams,
            'homeTeam': homeTeams,
            'TempProb': TempProbs,
            'windspeeds': windspeeds.strip(),
            'winddirection': winddirection.strip()
        }

        # Yield or give the scraped info to Scrapy
        yield scraped_info

这些是运行爬虫的命令

scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-6 -o NFLWeather_2012_week_6.json   
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-7 -o NFLWeather_2012_week_7.json
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-8 -o NFLWeather_2012_week_8.json

第六周的爬虫运行得很好,没有任何问题

第七周的爬虫什么都没返回

ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-7> (referer: None)
Traceback (most recent call last):
  File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
    yield next(it)

第八周只获取了两行数据,后面的都出错了

ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-8> (referer: None)
Traceback (most recent call last):
  File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
    yield next(it)

你知道为什么这些文件会失败,而其他文件没有问题吗?

1 个回答

1

这个错误出在 windspeeds 这个变量上。有时候天气数据缺失,所以 windspeeds 变量会变成 None。当你试图创建字典对象时,如果调用 windspeeds.strip(),就会出现错误。

你可以通过在创建字典时简单地检查一下是否为 None 来解决这个问题,或者你也可以选择在更早的时候进行检查,具体怎么做可以根据你的需要来决定。不过,这里有一个可以正常工作的例子:

scraped_info = {
   'Year': self.Year,
   'Game': self.Game,
   'Datetime': Datetimes.strip(),
   'awayTeam': awayTeams,
   'homeTeam': homeTeams,
   'TempProb': TempProbs,
   'windspeeds': windspeeds.strip() if windspeeds is not None else "TBD",
   'winddirection': winddirection.strip() if winddirection is not None else "TBD"
}

你还会注意到,你提供的那个“正常工作”的例子 week-6 现在会包含比之前更多的结果。

撰写回答