Python Scrapy 始终有效的函数
下面的脚本在收集天气数据时大约有90%的成功率。不过,有一些情况下,它会莫名其妙地失败,而这些失败的网页代码和其他请求是一样的。有时候,代码是一样的,请求也是一样的,但就是不成功。
class NflweatherdataSpider(scrapy.Spider):
name = 'NFLWeatherData'
allowed_domains = ['nflweather.com']
# start_urls = ['http://nflweather.com/']
def __init__(self, Week='', Year='', Game='', **kwargs):
self.start_urls = [f'https://nflweather.com/{Week}/{Year}/{Game}'] # py36
self.Year = Year
self.Game = Game
super().__init__(**kwargs)
print(self.start_urls) # python3
def parse(self, response):
self.log(self.start_urls)
#self.log(self.Year)
# pass
# Extracting the content using css selectors
# Extracting the content using css selectors
game_boxes = response.css('div.game-box')
for game_box in game_boxes:
# Extracting date and time information
Datetimes = game_box.css('.col-12 .fw-bold::text').get()
# Extracting team information
team_game_boxes = game_box.css('.team-game-box')
awayTeams = team_game_boxes.css('.fw-bold::text').get()
homeTeams = team_game_boxes.css('.fw-bold.ms-1::text').get()
# Extracting temperature and probability information
TempProbs = game_box.css('.col-md-4 .mx-2 span::text').get()
# Extracting wind speed information
windspeeds = game_box.css('.icon-weather + span::text').get()
winddirection = game_box.css('.md-18 ::text').get()
# Create a dictionary to store the scraped info
scraped_info = {
'Year': self.Year,
'Game': self.Game,
'Datetime': Datetimes.strip(),
'awayTeam': awayTeams,
'homeTeam': homeTeams,
'TempProb': TempProbs,
'windspeeds': windspeeds.strip(),
'winddirection': winddirection.strip()
}
# Yield or give the scraped info to Scrapy
yield scraped_info
这些是运行爬虫的命令
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-6 -o NFLWeather_2012_week_6.json
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-7 -o NFLWeather_2012_week_7.json
scrapy crawl NFLWeatherData -a Week=week -a Year=2012 -a Game=week-8 -o NFLWeather_2012_week_8.json
第六周的爬虫运行得很好,没有任何问题
第七周的爬虫什么都没返回
ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-7> (referer: None)
Traceback (most recent call last):
File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
yield next(it)
第八周只获取了两行数据,后面的都出错了
ERROR: Spider error processing <GET https://nflweather.com/week/2012/week-8> (referer: None)
Traceback (most recent call last):
File "G:\ProgramFiles\MiniConda3\envs\WrkEnv\lib\site-packages\scrapy\utils\defer.py", line 279, in iter_errback
yield next(it)
你知道为什么这些文件会失败,而其他文件没有问题吗?
1 个回答
1
这个错误出在 windspeeds
这个变量上。有时候天气数据缺失,所以 windspeeds
变量会变成 None
。当你试图创建字典对象时,如果调用 windspeeds.strip()
,就会出现错误。
你可以通过在创建字典时简单地检查一下是否为 None
来解决这个问题,或者你也可以选择在更早的时候进行检查,具体怎么做可以根据你的需要来决定。不过,这里有一个可以正常工作的例子:
scraped_info = {
'Year': self.Year,
'Game': self.Game,
'Datetime': Datetimes.strip(),
'awayTeam': awayTeams,
'homeTeam': homeTeams,
'TempProb': TempProbs,
'windspeeds': windspeeds.strip() if windspeeds is not None else "TBD",
'winddirection': winddirection.strip() if winddirection is not None else "TBD"
}
你还会注意到,你提供的那个“正常工作”的例子 week-6
现在会包含比之前更多的结果。