urllib2与线程

0 投票
1 回答
8325 浏览
提问于 2025-04-17 11:06

我正在尝试按照这个链接中的多线程示例进行操作:Python urllib2.urlopen()很慢,需要更好的方法来读取多个网址,但是我似乎遇到了一个“线程错误”,我不太明白这是什么意思。

urlList=[list of urls to be fetched]*100
def read_url(url, queue):
 my_data=[]
 try:
    data = urllib2.urlopen(url,None,15).read()
    print('Fetched %s from %s' % (len(data), url))
    my_data.append(data)
    queue.put(data)
except HTTPError, e:
    data = urllib2.urlopen(url).read()
    print('Fetched %s from %s' % (len(data), url))
    my_data.append(data)
    queue.put(data)

def fetch_parallel():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in urlList]
    for t in threads:
      t.start()
    for t in threads:
      t.join()
    return result

res=[]  
res=fetch_parallel()
reslist = []
while not res.empty: reslist.append(res.get())
print (reslist)

我首先遇到了以下错误:

Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "demo.py", line 76, in read_url
print('Fetched %s from %s' % (len(data), url))
TypeError: object of type 'instancemethod' has no len()

另一方面,我发现有时确实能获取到数据,但随后又出现了第二个错误:

Traceback (most recent call last):
File "demo.py", line 89, in <module>
print str(res[0])
AttributeError: Queue instance has no attribute '__getitem__'

当它获取数据时,为什么结果没有显示在res[]里呢?谢谢你的时间。

更新 在将read改为read()后,虽然情况有所改善(我现在能获取到很多页面),但仍然出现了错误:

Exception in thread Thread-86:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "demo.py", line 75, in read_url
data = urllib2.urlopen(url).read()
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 429, in error
result = self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 605, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python2.7/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 435, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 502: Bad Gateway

1 个回答

4

请注意,urllib2是不安全的,在多线程环境下使用可能会出问题。所以,建议你使用urllib3

你遇到的一些问题其实和线程没有关系。线程只是让错误报告变得更复杂。比如说,

data = urllib2.urlopen(url).read

你想要的是

data = urllib2.urlopen(url).read()
#                               ^^

一个502 Bad gateway错误表示服务器配置有问题(很可能是你连接的网络服务的内部服务器正在重启或不可用)。对此你无能为力——这个网址现在就是无法访问。你可以使用try..except来处理这些错误,比如打印一条诊断信息,或者安排在适当的等待时间后再尝试获取这个网址,或者干脆忽略掉失败的数据集。

要从队列中获取值,你可以这样做:

res = fetch_parallel()
reslist = []
while not res.empty():
  reslist.append(res.get_nowait()) # or get, doesn't matter here
print (reslist)

如果一个网址真的无法访问,处理错误是必须的。简单地重新请求可能在某些情况下有效,但你必须能够处理远程主机在此时确实无法访问的情况。如何处理这个问题取决于你应用程序的逻辑。

撰写回答