我正试图抓取一个超过2000个产品在一个类别的产品网站的详细资料,如名称等超过页数。但是,随着时间的推移,在不同的链接上随机出现上述错误。以下是回溯:
Traceback (most recent call last):
File "crawler1.py", line 103, in <module>
crawler(25)
File "crawler1.py", line 35, in crawler get_single_data(href)
File "crawler1.py", line 57, in get_single_data source_code = requests.get(item_url, timeout=335)
File "/Library/Python/2.7/site-packages/requests/api.py", line 71, in get return request('get', url, params=params, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 57, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 475, in request resp = self.send(prep, **send_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 585, in send r = adapter.send(request, **kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 467, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.therealreal.com', port=443): Max retries exceeded with url: /products/women/handbags/handle-bags/chanel-lax-handle-bag-4 (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x10d8de190>: Failed to establish a new connection: [Errno 60] Operation timed out',))
我通过捕捉所有的错误,在我能想到使用睡眠的每一个地方都增加了延迟。有没有办法避免这一点,我可以一次提取所有2000个产品的数据。?或者有人能提出解决办法吗。请帮忙。
代码如下:
try:
source_code = requests.get(item_url, timeout=335)
sleep(.3)
except requests.exceptions.ReadTimeout:
print("1")
sleep(30)
source_code = requests.get(item_url, timeout=335)
except requests.exceptions.Timeout:
print("2")
sleep(30)
source_code = requests.get(item_url, timeout=335)
except ConnectionError:
print("3")
sleep(30)
source_code = requests.get(item_url, timeout=335)
except socket.error:
sleep(30)
source_code = requests.get(item_url, timeout=335)
plain_text = source_code.text
temp = BeautifulSoup(plain_text)
另外,你可以忽略我甚至没有尝试过的超时和很多值。没有帮助。怎么了?
目前没有回答
相关问题 更多 >
编程相关推荐