Gevent池与嵌套网络请求
我想设置一个下载池,最多同时进行10个下载。这个功能应该先下载一个基础网址,然后解析这个页面上的所有网址,并下载每一个链接,但总的同时下载数量不能超过10个。
from lxml import etree
import gevent
from gevent import monkey, pool
import requests
monkey.patch_all()
urls = [
'http://www.google.com',
'http://www.yandex.ru',
'http://www.python.org',
'http://stackoverflow.com',
# ... another 100 urls
]
LINKS_ON_PAGE=[]
POOL = pool.Pool(10)
def parse_urls(page):
html = etree.HTML(page)
if html:
links = [link for link in html.xpath("//a/@href") if 'http' in link]
# Download each url that appears in the main URL
for link in links:
data = requests.get(link)
LINKS_ON_PAGE.append('%s: %s bytes: %r' % (link, len(data.content), data.status_code))
def get_base_urls(url):
# Download the main URL
data = requests.get(url)
parse_urls(data.content)
我该怎么做才能让这些下载同时进行,同时又能保持所有网络请求的总数量不超过10个呢?
3 个回答
0
你应该使用 gevent.queue 来正确地完成这个任务。
另外,这个(eventlet示例) 对你理解基本概念也会很有帮助。
Gevent的解决方案和eventlet很相似。
记得要有地方存储已经访问过的网址,这样才能避免重复访问,防止出现内存溢出错误,所以你需要设置一些限制。
4
gevent.pool 会限制同时运行的绿色线程数量,而不是限制连接的数量。
connection_limit = 10 adapter = requests.adapters.HTTPAdapter(pool_connections=connection_limit, pool_maxsize=connection_limit) session = requests.session() session.mount('http://', adapter) session.get('some url') # or do your work with gevent from gevent.pool import Pool # it should bigger than connection limit if the time of processing data # is longer than downings, # to give a change run processing. pool_size = 15 pool = Pool(pool_size) for url in urls: pool.spawn(session.get, url)
10
我觉得下面的代码应该能满足你的需求。在我的例子中,我使用了BeautifulSoup,而不是你之前提到的去掉链接的那些方法。
from bs4 import BeautifulSoup
import requests
import gevent
from gevent import monkey, pool
monkey.patch_all()
jobs = []
links = []
p = pool.Pool(10)
urls = [
'http://www.google.com',
# ... another 100 urls
]
def get_links(url):
r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.text)
links.extend(soup.find_all('a'))
for url in urls:
jobs.append(p.spawn(get_links, url))
gevent.joinall(jobs)