requests.adapters.HTTPAdapter中pool_连接的含义是什么?

2024-05-15 00:00:02 发布

您现在位置:Python中文网/ 问答频道 /正文

初始化请求'Session时,将创建两个^{}mount to ^{} and ^{}

这就是HTTPAdapter的定义:

class requests.adapters.HTTPAdapter(pool_connections=10, pool_maxsize=10,
                                    max_retries=0, pool_block=False)

虽然我理解pool_maxsize(池可以保存的会话数)的含义,但我不理解pool_connections的含义或它的作用。医生说:

Parameters: 
pool_connections – The number of urllib3 connection pools to cache.

但是“缓存”是什么意思?使用多个连接池有什么意义?


Tags: andto定义sessionrequestsconnectionsmaxclass
3条回答

请求使用urllib3来管理其连接和其他功能。

重新使用连接是保持重复HTTP请求性能的一个重要因素。The urllib3 README explains

Why do I want to reuse connections?

Performance. When you normally do a urllib call, a separate socket connection is created with each request. By reusing existing sockets (supported since HTTP 1.1), the requests will take up less resources on the server's end, and also provide a faster response time at the client's end. [...]

要回答您的问题,“pool_maxsize”是每个主机要保留的连接数(这对于多线程应用程序很有用),而“pool_connections”是要保留的主机池数。例如,如果您连接到100个不同的主机,并且pool_connections=10,则只会重新使用最新10个主机的连接。

感谢@laike9m提供了现有的问答和文章,但是现有的答案没有提到pool_maxsize的微妙之处及其与多线程代码的关系。

小结

  • pool_connections是一个端点(主机、端口、方案)在给定时间内可以在池中保持活动状态的连接数。如果您想在池中保持最大值为n的打开TCP连接以便与Session一起重用,则需要pool_connections=n
  • pool_maxsize实际上与requests的用户无关,因为pool_block(in requests.adapters.HTTPAdapter)的默认值是False,而不是True

细节

正如这里正确指出的,pool_connections是给定适配器前缀的最大打开连接数。最好通过例子来说明:

>>> import requests
>>> from requests.adapters import HTTPAdapter
>>> 
>>> from urllib3 import add_stderr_logger
>>> 
>>> add_stderr_logger()  # Turn on requests.packages.urllib3 logging
2018-12-21 20:44:03,979 DEBUG Added a stderr logging handler to logger: urllib3
<StreamHandler <stderr> (NOTSET)>
>>> 
>>> s = requests.Session()
>>> s.mount('https://', HTTPAdapter(pool_connections=1))
>>> 
>>> # 4 consecutive requests to (github.com, 443, https)
... # A new HTTPS (TCP) connection will be established only on the first conn.
... s.get('https://github.com/requests/requests/blob/master/requests/adapters.py')
2018-12-21 20:44:03,982 DEBUG Starting new HTTPS connection (1): github.com:443
2018-12-21 20:44:04,381 DEBUG https://github.com:443 "GET /requests/requests/blob/master/requests/adapters.py HTTP/1.1" 200 None
<Response [200]>
>>> s.get('https://github.com/requests/requests/blob/master/requests/packages.py')
2018-12-21 20:44:04,548 DEBUG https://github.com:443 "GET /requests/requests/blob/master/requests/packages.py HTTP/1.1" 200 None
<Response [200]>
>>> s.get('https://github.com/urllib3/urllib3/blob/master/src/urllib3/__init__.py')
2018-12-21 20:44:04,881 DEBUG https://github.com:443 "GET /urllib3/urllib3/blob/master/src/urllib3/__init__.py HTTP/1.1" 200 None
<Response [200]>
>>> s.get('https://github.com/python/cpython/blob/master/Lib/logging/__init__.py')
2018-12-21 20:44:06,533 DEBUG https://github.com:443 "GET /python/cpython/blob/master/Lib/logging/__init__.py HTTP/1.1" 200 None
<Response [200]>

上面,最大连接数是1;它是(github.com, 443, https)。如果要从新的(主机、端口、方案)三元组请求资源,内部的Session将转储现有连接,为新连接腾出空间:

>>> s.get('https://www.rfc-editor.org/info/rfc4045')
2018-12-21 20:46:11,340 DEBUG Starting new HTTPS connection (1): www.rfc-editor.org:443
2018-12-21 20:46:12,185 DEBUG https://www.rfc-editor.org:443 "GET /info/rfc4045 HTTP/1.1" 200 6707
<Response [200]>
>>> s.get('https://www.rfc-editor.org/info/rfc4046')
2018-12-21 20:46:12,667 DEBUG https://www.rfc-editor.org:443 "GET /info/rfc4046 HTTP/1.1" 200 6862
<Response [200]>
>>> s.get('https://www.rfc-editor.org/info/rfc4047')
2018-12-21 20:46:13,837 DEBUG https://www.rfc-editor.org:443 "GET /info/rfc4047 HTTP/1.1" 200 6762
<Response [200]>

你可以把这个数字调高到pool_connections=2,然后在3个不同的主机组合之间循环,你将在游戏中看到相同的东西。(另一件需要注意的事情是,会话将以同样的方式保留和发送cookies。)

现在是pool_maxsize,它被传递给urllib3.poolmanager.PoolManager,并最终传递给urllib3.connectionpool.HTTPSConnectionPool。maxsize的docstring是:

Number of connections to save that can be reused. More than 1 is useful in multithreaded situations. If block is set to False, more connections will be created but they will not be saved once they've been used.

顺便说一下,block=FalseHTTPAdapter的默认值,即使HTTPConnectionPool的默认值是True。这意味着pool_maxsizeHTTPAdapter几乎没有影响。

此外,requests.Session()而不是线程安全的;您不应该从多个线程使用相同的session实例。(请参见herehere)如果您真的想这样做,更安全的方法是将每个线程借给它自己的本地化会话实例,然后使用该会话通过^{}通过多个URL发出请求:

import threading
import requests

local = threading.local()  # values will be different for separate threads.

vars(local)  # initially empty; a blank class with no attrs.


def get_or_make_session(**adapter_kwargs):
    # `local` will effectively vary based on the thread that is calling it
    print('get_or_make_session() called from id:', threading.get_ident())

    if not hasattr(local, 'session'):
        session = requests.Session()
        adapter = requests.adapters.HTTPAdapter(**kwargs)
        session.mount('http://', adapter)
        session.mount('https://', adapter)
        local.session = session
    return local.session

我写了一篇关于这个的文章。贴在这里:

请求的秘密:池连接和池最大大小

Requests是Python程序员最著名的Python第三方库之一(如果不是的话)。由于其简单的API和高性能,人们倾向于使用请求,而不是标准库为HTTP请求提供的urllib2。然而,每天使用请求的人可能不知道请求的内部结构,今天我要介绍其中两个:pool_connectionspool_maxsize

让我们从Session开始:

import requests

s = requests.Session()
s.get('https://www.google.com')

很简单。您可能知道请求'Session可以持久化cookie。很酷。但是你知道Session有一个^{}方法吗?

mount(prefix, adapter)
Registers a connection adapter to a prefix.
Adapters are sorted in descending order by key length.

没有?实际上,当您initialize a ^{} object使用此方法时:

class Session(SessionRedirectMixin):

    def __init__(self):
        ...
        # Default connection adapters.
        self.adapters = OrderedDict()
        self.mount('https://', HTTPAdapter())
        self.mount('http://', HTTPAdapter())

有趣的部分来了。如果您阅读了Ian Cordasco的文章Retries in Requests,您应该知道HTTPAdapter可以用来提供重试功能。但什么是真正的HTTPAdapter?引自doc

class requests.adapters.HTTPAdapter(pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False)

The built-in HTTP Adapter for urllib3.

Provides a general-case interface for Requests sessions to contact HTTP and HTTPS urls by implementing the Transport Adapter interface. This class will usually be created by the Session class under the covers.

Parameters:
* pool_connections – The number of urllib3 connection pools to cache. * pool_maxsize – The maximum number of connections to save in the pool. * max_retries(int) – The maximum number of retries each connection should attempt. Note, this applies only to failed DNS lookups, socket connections and connection timeouts, never to requests where data has made it to the server. By default, Requests does not retry failed connections. If you need granular control over the conditions under which we retry a request, import urllib3’s Retry class and pass that instead. * pool_block – Whether the connection pool should block for connections. Usage:

>>> import requests
>>> s = requests.Session()
>>> a = requests.adapters.HTTPAdapter(max_retries=3)
>>> s.mount('http://', a)

如果上面的文档让您感到困惑,那么我的解释是:HTTP适配器所做的只是根据目标url为不同的请求提供不同的配置。还记得上面的代码吗?

self.mount('https://', HTTPAdapter())
self.mount('http://', HTTPAdapter())

它用默认参数pool_connections=10, pool_maxsize=10, max_retries=0, pool_block=False创建两个HTTPAdapter对象,并分别挂载到https://http://,这意味着如果您尝试向http://xxx发送请求,将使用第一个HTTPAdapter()的配置,第二个HTTPAdapter()将用于对https://xxx的请求。在这种情况下,这两种配置是相同的,对httphttps的请求仍然是分别处理的。我们待会儿再看它是什么意思。

正如我所说,本文的主要目的是解释pool_connectionspool_maxsize

首先让我们看看pool_connections。昨天我在stackoverflow上提出了一个{a7},因为我不确定我的理解是否正确,答案消除了我的不确定性。众所周知,HTTP是基于TCP协议的。HTTP连接也是一个TCP连接,它由五个值的元组标识:

(<protocol>, <src addr>, <src port>, <dest addr>, <dest port>)

假设您已经与www.example.com建立了HTTP/TCP连接,假设服务器支持Keep-Alive,下一次向www.example.com/awww.example.com/b发送请求时,您可以使用相同的连接,因为这五个值都没有更改。事实上,requests' Session automatically does this for you将尽可能地重用连接。

问题是,什么决定了您是否可以重用旧连接?是的,pool_connections

pool_connections – The number of urllib3 connection pools to cache.

我知道,我知道,我也不想带这么多术语,这是最后一个,我保证。为了便于理解,一个连接池对应于一个主机。

下面是一个例子(不相关的行被忽略):

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1))
s.get('https://www.baidu.com')
s.get('https://www.zhihu.com')
s.get('https://www.baidu.com')

"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2621
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
"""

HTTPAdapter(pool_connections=1)被装载到https://,这意味着一次只存在一个连接池。在调用s.get('https://www.baidu.com')之后,缓存的连接池是connectionpool('https://www.baidu.com')。现在s.get('https://www.zhihu.com')来了,会话发现它不能使用以前缓存的连接,因为它不是同一个主机(一个连接池对应于一个主机,记得吗?)。因此,会话必须创建一个新的连接池,或者连接(如果您愿意)。由于pool_connections=1,会话不能同时容纳两个连接池,因此它放弃了原来的connectionpool('https://www.baidu.com'),保留了新的connectionpool('https://www.zhihu.com')。下一个get是相同的。这就是为什么我们在日志中看到三个Starting new HTTPS connection

如果我们将pool_connections设置为2:

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=2))
s.get('https://www.baidu.com')
s.get('https://www.zhihu.com')
s.get('https://www.baidu.com')
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.baidu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2623
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 None
"""

很好,现在我们只创建了两次连接,节省了一次连接建立时间。

最后,pool_maxsize

首先也是最重要的是,只有在多线程环境中使用Session时,您才应该关心pool_maxsize,就像使用相同的Session从多个线程发出并发请求一样。

实际上,pool_maxsize是初始化urllib3的参数^{},这正是我们上面提到的连接池。 HTTPConnectionPool是指向特定主机的连接集合的容器,而pool_maxsize是可重用的要保存的连接数。如果您在一个线程中运行代码,则既不可能也不需要创建到同一主机的多个连接,因为请求库被阻塞,因此总是一个接一个地发送HTTP请求。

如果有多个线程,情况就不同了。

def thread_get(url):
    s.get(url)

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=2))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 = Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start();t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2606
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57556
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
"""

看到了吗?它为同一个主机建立了两个连接,就像我说的,这只能在多线程环境中发生。 在本例中,我们使用pool_maxsize=2创建了一个连接池,并且同时没有超过两个连接,所以这就足够了。 我们可以看到来自t3t4的请求没有创建新的连接,它们重用了旧的连接。

如果尺寸不够怎么办?

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=1))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 = Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start()
t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
t3.join();t4.join()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2606
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (3): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57556
WARNING:requests.packages.urllib3.connectionpool:Connection pool is full, discarding connection: www.zhihu.com
"""

现在,pool_maxsize=1,警告如期而至:

Connection pool is full, discarding connection: www.zhihu.com

我们还可以注意到,由于此池中只能保存一个连接,因此将再次为t3t4创建新连接。显然这是非常低效的。这就是为什么在urllib3的文档中说:

If you’re planning on using such a pool in a multithreaded environment, you should set the maxsize of the pool to a higher number, such as the number of threads.

最后但并非最不重要的是,HTTPAdapter装载到不同前缀的实例是独立的。

s = requests.Session()
s.mount('https://', HTTPAdapter(pool_connections=1, pool_maxsize=2))
s.mount('https://baidu.com', HTTPAdapter(pool_connections=1, pool_maxsize=1))
t1 = Thread(target=thread_get, args=('https://www.zhihu.com',))
t2 =Thread(target=thread_get, args=('https://www.zhihu.com/question/36612174',))
t1.start();t2.start()
t1.join();t2.join()
t3 = Thread(target=thread_get, args=('https://www.zhihu.com/question/39420364',))
t4 = Thread(target=thread_get, args=('https://www.zhihu.com/question/21362402',))
t3.start();t4.start()
t3.join();t4.join()
"""output
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): www.zhihu.com
INFO:requests.packages.urllib3.connectionpool:Starting new HTTPS connection (2): www.zhihu.com
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/36612174 HTTP/1.1" 200 21906
DEBUG:requests.packages.urllib3.connectionpool:"GET / HTTP/1.1" 200 2623
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/39420364 HTTP/1.1" 200 28739
DEBUG:requests.packages.urllib3.connectionpool:"GET /question/21362402 HTTP/1.1" 200 57669
"""

上面的代码很容易理解,所以我不解释。

我想就这些了。希望本文能帮助您更好地理解请求。顺便说一下,我创建了一个gisthere,其中包含本文中使用的所有测试代码。下载并播放:)

附录

  1. 对于https,请求使用urllib3的HTTPSConnectionPool,但它与HTTPConnectionPool几乎相同,因此我在本文中不区分它们。
  2. Sessionmount方法将确保首先匹配最长的前缀。它的实现非常有趣,所以我把它贴在这里。

    def mount(self, prefix, adapter):
        """Registers a connection adapter to a prefix.
        Adapters are sorted in descending order by key length."""
        self.adapters[prefix] = adapter
        keys_to_move = [k for k in self.adapters if len(k) < len(prefix)]
        for key in keys_to_move:
            self.adapters[key] = self.adapters.pop(key)
    

    注意self.adapters是一个OrderedDict

相关问题 更多 >

    热门问题