我正在开发一个简单的web scraper,并尝试实现一些多线程。虽然我的代码在某些服务器上可以正常工作(大大缩短了执行时间),但我的主要目标是让它在少数特定服务器上工作。所以,当我在站点列表中尝试它时,我得到的性能就像我仍然使用顺序代码一样。猜猜是什么引起的
import requests, time
from bs4 import BeautifulSoup
from threading import Thread
from random import choice
# Enable to get some logging info
#---------------------------------
# import logging
# import http.client
# http.client.HTTPConnection.debuglevel = 1
# logging.basicConfig()
# logging.getLogger().setLevel(logging.DEBUG)
# requests_log = logging.getLogger("requests.packages.urllib3")
# requests_log.setLevel(logging.DEBUG)
# requests_log.propagate = True
sites = [
"https://pikabu.ru/community/blackhumour",
"https://www.pikabu.ru/tag/%D0%9C%D0%B5%D0%BC%D1%8B/hot"
]
class Pikabu_Downloader(Thread):
def __init__(self, url, name, *args, **kwargs):
super().__init__(*args, **kwargs)
self.url = url
self.name = name
self.begin = time.time()
def run(self):
print("Beginning with thread number",self.name, ",", round(time.time()-self.begin, 4), " seconds has passed")
html_data = self._get_html()
print("After requests.get with thread number", self.name, ",", round(time.time()-self.begin, 4), " seconds has passed")
if html_data is None:
return
self.soup = BeautifulSoup(html_data, "html.parser")
print("After making soup with thread number", self.name, ",", round(time.time() - self.begin, 4), " seconds has passed")
def _get_html(self):
try:
user_agents = ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)', 'AppleWebKit/537.36 (KHTML, like Gecko)', 'Chrome/74.0.3729.169', 'Safari/537.36')
print(f"Go {self.url}...")
res = requests.get(self.url, headers={'User-Agent': choice(user_agents)}, stream = True)#, allow_redirects=False)
except Exception as exc:
print(exc)
else:
return res.text
test = "https://readingbooks.site/read/?name=1984&"
def download():
pikabu_urls = []
for url in sites:
pikabu = [url + "?page=" + str(x) for x in range(1, 10)]
pikabu_urls = pikabu_urls + pikabu
pikabu_dls = [Pikabu_Downloader(url=page, name=str(i)) for i, page in enumerate(pikabu_urls)]
# Comment the string above and enable 2 underlying strings to get result from test server
# tests = [test + "page=" + str(x) for x in range(1, pages)]
# pikabu_dls = [Pikabu_Downloader(url=page, name=str(i)) for i, page in enumerate(tests)]
for pikabu_dl in pikabu_dls:
pikabu_dl.start()
for pikabu_dl in pikabu_dls:
pikabu_dl.join()
download()
结果是
...
After requests.get with thread number 1 , 1.6904 seconds has passed
After making soup with thread number 1 , 1.7554 seconds has passed
After requests.get with thread number 2 , 2.9805 seconds has passed
After making soup with thread number 2 , 3.0455 seconds has passed
After requests.get with thread number 3 , 4.3225 seconds has passed
After making soup with thread number 3 , 4.3895 seconds has passed
...
什么会导致线程执行之间的延迟?我希望让每个线程几乎同时完成,并获得更多的异步输出,就像使用测试中的服务器一样。如果我在requests.get中设置了5秒的超时时间,那么大多数请求都不会工作
在我调查了您的案例之后,我会指出您遇到的一些问题:
Coroutine
上的大量异步请求,而asyncio将支持语法并在主线程上操作李>我在colab上做了一个小实验,请注意,由于卡滞,我没有在colab上使用asyncio和aiohttp,但我以前在几个项目上实现过,它比低于最快的方法工作得更快
第二个功能是您的实现
简而言之,我建议您使用ThreadPool(在colab中更可取)或asyncio和aiohttp(不在colab中)来提高速度
相关问题 更多 >
编程相关推荐