Python中urllib3和多线程示例

2 投票

4 回答

12329 浏览

提问于 2025-04-16 04:15

我正在尝试在简单的线程中使用urllib3来获取几个维基页面。

这个脚本会为每个线程创建一个连接（我不太明白为什么），然后就一直挂在那里。

有没有什么建议、提示或者简单的urllib3和线程的例子呢？

import threadpool
from urllib3 import connection_from_url

HTTP_POOL = connection_from_url(url, timeout=10.0, maxsize=10, block=True)

def fetch(url, fiedls):
  kwargs={'retries':6}
  return HTTP_POOL.get_url(url, fields, **kwargs)

pool = threadpool.ThreadPool(5)
requests = threadpool.makeRequests(fetch, iterable)
[pool.putRequest(req) for req in requests]

@Lennart的脚本出现了这个错误：

http://en.wikipedia.org/wiki/2010-11_Premier_LeagueTraceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
 http://en.wikipedia.org/wiki/List_of_MythBusters_episodeshttp://en.wikipedia.org/wiki/List_of_Top_Gear_episodes http://en.wikipedia.org/wiki/List_of_Unicode_characters    result = request.callable(*request.args, **request.kwds)
  File "crawler.py", line 9, in fetch
    print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
    result = request.callable(*request.args, **request.kwds)
  File "crawler.py", line 9, in fetch
    print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
    result = request.callable(*request.args, **request.kwds)
  File "crawler.py", line 9, in fetch
    print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/threadpool.py", line 156, in run
    result = request.callable(*request.args, **request.kwds)
  File "crawler.py", line 9, in fetch
    print url, conn.get_url(url)
AttributeError: 'HTTPConnectionPool' object has no attribute 'get_url'

在添加了import threadpool; import urllib3和tpool = threadpool.ThreadPool(4)后，@user318904的代码出现了这个错误：

Traceback (most recent call last):
  File "crawler.py", line 21, in <module>
    tpool.map_async(fetch, urls)
AttributeError: ThreadPool instance has no attribute 'map_async'

multithreading web scraping http requests urllib3 thread management concurrent programming

4 个回答

线程编程很难，所以我写了一个叫 workerpool 的工具，专门为了让你现在做的事情变得简单一些。

更具体来说，可以看看这个批量下载器的例子。

如果用 urllib3 来做同样的事情，代码大概是这样的：

import urllib3
import workerpool

pool = urllib3.connection_from_url("foo", maxsize=3)

def download(url):
    r = pool.get_url(url)
    # TODO: Do something with r.data
    print "Downloaded %s" % url

# Initialize a pool, 5 threads in this case
pool = workerpool.WorkerPool(size=5)

# The ``download`` method will be called with a line from the second 
# parameter for each job.
pool.map(download, open("urls.txt").readlines())

# Send shutdown jobs to all threads, and wait until all the jobs have been completed
pool.shutdown()
pool.wait()

如果你想看更复杂的代码，可以看看 workerpool.EquippedWorker（还有这里的测试，可以了解如何使用）。你可以把这个池子当作你传入的 toolbox。

回答于 2025-04-16 由 Python大师

分享举报

显然，每个线程都会创建一个连接，否则每个线程怎么能获取网页呢？而你试图用同一个连接，来自一个网址，去处理所有的网址。这显然不是你想要的效果。

这段代码运行得很好：

import threadpool
from urllib3 import connection_from_url

def fetch(url):
  kwargs={'retries':6}
  conn = connection_from_url(url, timeout=10.0, maxsize=10, block=True)
  print url, conn.get_url(url)
  print "Done!"

pool = threadpool.ThreadPool(4)
urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
        'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
        'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
        'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
        ]
requests = threadpool.makeRequests(fetch, urls)

[pool.putRequest(req) for req in requests]
pool.wait()

回答于 2025-04-16 由 Python大师

分享举报

这是我对这个问题的看法，使用Python3和concurrent.futures.ThreadPoolExecutor提供了一个更现代的解决方案。

import urllib3
from concurrent.futures import ThreadPoolExecutor

urls = ['http://en.wikipedia.org/wiki/2010-11_Premier_League',
        'http://en.wikipedia.org/wiki/List_of_MythBusters_episodes',
        'http://en.wikipedia.org/wiki/List_of_Top_Gear_episodes',
        'http://en.wikipedia.org/wiki/List_of_Unicode_characters',
        ]

def download(url, cmanager):
    response = cmanager.request('GET', url)
    if response and response.status == 200:
        print("+++++++++ url: " + url)
        print(response.data[:1024])

connection_mgr = urllib3.PoolManager(maxsize=5)
thread_pool = ThreadPoolExecutor(5)
for url in urls:
    thread_pool.submit(download, url, connection_mgr)

一些说明

我的代码是基于Beazley和Jones的Python Cookbook中的一个类似例子。
我特别喜欢这个方法，因为除了urllib3，你只需要一个标准模块。
设置非常简单，如果你只想在download中实现一些副作用（比如打印、保存到文件等），那么合并线程就不需要额外的努力。
如果你想要不同的效果，ThreadPoolExecutor.submit实际上会返回download的返回值，只不过是包裹在一个Future对象里。
我发现将线程池中的线程数量与连接池中的HTTPConnection数量（通过maxsize设置）对齐是很有帮助的。否则，当所有线程尝试访问同一个服务器时（就像例子中那样），你可能会遇到一些（无害的）警告。

回答于 2025-04-16 由 Python大师

分享举报

Python中urllib3和多线程示例

4 个回答

一些说明

撰写回答