<p>我写了一篇关于这个的文章。贴在这里:</p>
<h2>请求的秘密:池连接和池最大大小</h2>
<p><a href="http://docs.python-requests.org/en/latest/" rel="noreferrer">Requests</a>是Python程序员最著名的Python第三方库之一(如果不是的话)。由于其简单的API和高性能,人们倾向于使用请求,而不是标准库为HTTP请求提供的urllib2。然而,每天使用请求的人可能不知道请求的内部结构,今天我要介绍其中两个:<code>pool_connections</code>和<code>pool_maxsize</code>。</p>
<p>让我们从<code>Session</code>开始:</p>
<pre><code>import requests
s = requests.Session()
s.get('https://www.google.com')
</code></pre>
<p>很简单。您可能知道请求'<code>Session</code>可以持久化cookie。很酷。但是你知道<code>Session</code>有一个<a href="http://docs.python-requests.org/en/latest/api/#requests.Session.mount" rel="noreferrer">^{<cd6>}</a>方法吗?</p>
<blockquote>
<p><code>mount(prefix, adapter)</code><br/>
Registers a connection adapter to a prefix.<br/>
Adapters are sorted in descending order by key length.</p>
</blockquote>
<p>没有?实际上,当您<a href="https://github.com/kennethreitz/requests/blob/master/requests/sessions.py#L340-L341" rel="noreferrer">initialize a ^{<cd3>} object</a>使用此方法时:</p>
<pre><code>class Session(SessionRedirectMixin):
def __init__(self):
...
# Default connection adapters.
self.adapters = OrderedDict()
self.mount('https://', HTTPAdapter())
self.mount('http://', HTTPAdapter())
</code></pre>
<p>有趣的部分来了。如果您阅读了Ian Cordasco的文章<a href="http://www.coglib.com/~icordasc/blog/2014/12/retries-in-requests.html" rel="noreferrer">Retries in Requests</a>,您应该知道<code>HTTPAdapter</code>可以用来提供重试功能。但什么是真正的<code>HTTPAdapter</code>?引自<a href="http://docs.python-requests.org/en/latest/api/#requests.adapters.HTTPAdapter" rel="noreferrer">doc</a>:</p>
<blockquote>
<p><code>class requests.adapters.HTTPAdapter(pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False)</code></p>
<p>The built-in HTTP Adapter for urllib3.</p>
<p>Provides a general-case interface for Requests sessions to contact HTTP and HTTPS urls by implementing the Transport Adapter interface. This class will usually be created by the Session class under the covers.</p>
<p>Parameters:<br/>
* <code>pool_connections</code> – The number of urllib3 connection pools to cache.
* <code>pool_maxsize</code> – The maximum number of connections to save in the pool.
* <code>max_retries(int)</code> – The maximum number of retries each connection should attempt. Note, this applies only to failed DNS lookups, socket connections and connection timeouts, never to requests where data has made it to the server. By default, Requests does not retry failed connections. If you need granular control over the conditions under which we retry a request, import urllib3’s Retry class and pass that instead.
* <code>pool_block</code> – Whether the connection pool should block for connections.
Usage:</p>
</blockquote>
<pre><code>>>> import requests
>>> s = requests.Session()
>>> a = requests.adapters.HTTPAdapter(max_retries=3)
>>> s.mount('http://', a)
</code></pre>
<p>如果上面的文档让您感到困惑,那么我的解释是:HTTP适配器所做的只是根据目标url为不同的请求提供不同的配置。还记得上面的代码吗?</p>
<pre><code>self.mount('https://', HTTPAdapter())
self.mount('http://', HTTPAdapter())
</code></pre>
<p>它用默认参数<code>pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False</code>创建两个<code>HTTPAdapter</code>对象,并分别挂载到<code>https://</code>和<code>http://</code>,这意味着如果您尝试向<code>http://xxx</code>发送请求,将使用第一个<code>HTTPAdapter()</code>的配置,第二个<code>HTTPAdapter()</code>将用于对<code>https://xxx</code>的请求。在这种情况下,这两种配置是相同的,对<code>http</code>和<code>https</code>的请求仍然是分别处理的。我们待会儿再看它是什么意思。</p>
<p>正如我所说,本文的主要目的是解释<code>pool_connections</code>和<code>pool_maxsize</code>。</p>
<p>首先让我们看看<code>pool_connections</code>。昨天我在stackoverflow上提出了一个{a7},因为我不确定我的理解是否正确,答案消除了我的不确定性。众所周知,HTTP是基于TCP协议的。HTTP连接也是一个TCP连接,它由<strong>五个值的元组标识:</p>
<pre><code>(<protocol>, <src addr>, <src port>, <dest addr>, <dest port>)
</code></pre>
<p>假设您已经与<code>www.example.com</code>建立了HTTP/TCP连接,假设服务器支持<code>Keep-Alive</code>,下一次向<code>www.example.com/a</code>或<code>www.example.com/b</code>发送请求时,您可以使用相同的连接,因为这五个值都没有更改。事实上,<a href="http://docs.python-requests.org/en/latest/user/advanced/#keep-alive" rel="noreferrer">requests' Session automatically does this for you</a>将尽可能地重用连接。</p>
<p>问题是,什么决定了您是否可以重用旧连接?是的,<code>pool_connections</code>!</p>
<blockquote>
<p>pool_connections – The number of urllib3 connection pools to cache.</p>
</blockquote>
<p>我知道,我知道,我也不想带这么多术语,这是最后一个,我保证。为了便于理解,一个连接池对应于一个主机。</p>
<p>下面是一个例子(不相关的行被忽略):</p>
<pre><code>s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1))
s.get('https://www.baidu.com')
s.get('https://www.zhihu.com')
s.get('https://www.baidu.com')
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2621
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
"""
</code></pre>
<p><code>HTTPAdapter(pool_connections=1)</code>被装载到<code>https://</code>,这意味着一次只存在一个连接池。在调用<code>s.get('https://www.baidu.com')</code>之后,缓存的连接池是<code>connectionpool('https://www.baidu.com')</code>。现在<code>s.get('https://www.zhihu.com')</code>来了,会话发现它不能使用以前缓存的连接,因为它不是同一个主机(一个连接池对应于一个主机,记得吗?)。因此,会话必须创建一个新的连接池,或者连接(如果您愿意)。由于<code>pool_connections=1</code>,会话不能同时容纳两个连接池,因此它放弃了原来的<code>connectionpool('https://www.baidu.com')</code>,保留了新的<code>connectionpool('https://www.zhihu.com')</code>。下一个<code>get</code>是相同的。这就是为什么我们在日志中看到三个<code>Starting new HTTPS connection</code>。</p>
<p>如果我们将<code>pool_connections</code>设置为2:</p>
<pre><code>s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=2))
s.get('https://www.baidu.com')
s.get('https://www.zhihu.com')
s.get('https://www.baidu.com')
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2623
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
"""
</code></pre>
<p>很好,现在我们只创建了两次连接,节省了一次连接建立时间。</p>
<p>最后,<code>pool_maxsize</code>。</p>
<p>首先也是最重要的是,只有在多线程环境中使用<code>Session</code>时,您才应该关心<code>pool_maxsize</code>,就像使用相同的</strong><code>Session</code>从多个线程发出并发请求一样。</p>
<p>实际上,<code>pool_maxsize</code>是初始化urllib3的参数<a href="http://urllib3.readthedocs.org/en/latest/pools.html#module-urllib3.connectionpool" rel="noreferrer">^{<cd44>}</a>,这正是我们上面提到的连接池。
<code>HTTPConnectionPool</code>是指向特定主机的连接集合的容器,而<code>pool_maxsize</code>是可重用的要保存的连接数。如果您在一个线程中运行代码,则既不可能也不需要创建到同一主机的多个连接,因为请求库被阻塞,因此总是一个接一个地发送HTTP请求。</p>
<p>如果有多个线程,情况就不同了。</p>
<pre><code>def thread_get(url):
s.get(url)
s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=2))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 = Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start();t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2606
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57556
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
"""
</code></pre>
<p>看到了吗?它为同一个主机建立了两个连接,就像我说的,这只能在多线程环境中发生。
在本例中,我们使用<code>pool_maxsize=2</code>创建了一个连接池,并且同时没有超过两个连接,所以这就足够了。
我们可以看到来自<code>t3</code>和<code>t4</code>的请求没有创建新的连接,它们重用了旧的连接。</p>
<p>如果尺寸不够怎么办?</p>
<pre><code>s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=1))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 = Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start()
t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
t3.join();t4.join()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2606
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (3): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57556
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: www.zhihu.com
"""
</code></pre>
<p>现在,<code>pool_maxsize=1</code>,警告如期而至:</p>
<pre><code>Connection pool is full, discarding connection: www.zhihu.com
</code></pre>
<p>我们还可以注意到,由于此池中只能保存一个连接,因此将再次为<code>t3</code>或<code>t4</code>创建新连接。显然这是非常低效的。这就是为什么在urllib3的文档中说:</p>
<blockquote>
<p>If you’re planning on using such a pool in a multithreaded environment, you should set the maxsize of the pool to a higher number, such as the number of threads.</p>
</blockquote>
<p>最后但并非最不重要的是,<code>HTTPAdapter</code>装载到不同前缀的实例是独立的。</p>
<pre><code>s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=2))
s.mount('https://baidu.com', HTTPAdapter(pool_connections=1, pool_maxsize=1))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 =Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start();t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
t3.join();t4.join()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2623
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57669
"""
</code></pre>
<p>上面的代码很容易理解,所以我不解释。</p>
<p>我想就这些了。希望本文能帮助您更好地理解请求。顺便说一下,我创建了一个gist<a href="https://gist.github.com/laike9m/ead19c65a416c7022c00" rel="noreferrer">here</a>,其中包含本文中使用的所有测试代码。下载并播放:)</p>
<h2>附录</h2>
<ol>
<li>对于https,请求使用urllib3的<a href="http://urllib3.readthedocs.org/en/latest/pools.html#urllib3.connectionpool.HTTPSConnectionPool" rel="noreferrer">HTTPSConnectionPool</a>,但它与HTTPConnectionPool几乎相同,因此我在本文中不区分它们。</li>
<li><p><code>Session</code>的<code>mount</code>方法将确保首先匹配最长的前缀。它的实现非常有趣,所以我把它贴在这里。</p>
<pre><code>def mount(self, prefix, adapter):
"""Registers a connection adapter to a prefix.
Adapters are sorted in descending order by key length."""
self.adapters[prefix] = adapter
keys_to_move = [k for k in self.adapters if len(k) < len(prefix)]
for key in keys_to_move:
self.adapters[key] = self.adapters.pop(key)
</code></pre>
<p>注意<code>self.adapters</code>是一个<code>OrderedDict</code>。</p></li>
</ol>