在Python中使用代理抓取网页

0 投票
2 回答
3478 浏览
提问于 2025-04-15 20:55

我正在尝试用Python写一个函数,使用一个公共的匿名代理来获取网页,但遇到了一个比较奇怪的错误。
以下是我的代码(我用的是Python 2.4):

import urllib2    
def get_source_html_proxy(url, pip, timeout):
# timeout in seconds (maximum number of seconds willing for the code to wait in
# case there is a proxy that is not working, then it gives up) 
    proxy_handler = urllib2.ProxyHandler({'http': pip})
    opener = urllib2.build_opener(proxy_handler)
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    urllib2.install_opener(opener)
    req=urllib2.Request(url)
    sock=urllib2.urlopen(req)
    timp=0 # a counter that is going to measure the time until the result (webpage) is
           # returned
    while 1:
        data = sock.read(1024)
        timp=timp+1
        if len(data) < 1024: break
        timpLimita=50000000 * timeout
        if timp==timpLimita: # 5 millions is about 1 second
            break
    if timp==timpLimita:
        print IPul + ": Connection is working, but the webpage is fetched in more than 50 seconds. This proxy returns the following IP: " + str(data)
        return str(data)
    else:
        print "This proxy " + IPul + "= good proxy. " + "It returns the following IP: " + str(data)
        return str(data)
# Now, I call the function to test it for one single proxy (IP:port) that does not support user and password (a public high anonymity proxy)
#(I put a proxy that I know is working - slow, but is working)
rez=get_source_html_proxy("http://www.whatismyip.com/automation/n09230945.asp", "93.84.221.248:3128", 50)
print rez

错误信息:

追踪记录(最近的调用在最前面):

文件 "./public_html/cgi-bin/teste5.py",第43行,在 ?

rez=get_source_html_proxy("http://www.whatismyip.com/automation/n09230945.asp", "xx.yy.zzz.ww:3128", 50)

文件 "./public_html/cgi-bin/teste5.py",第18行,在 get_source_html_proxy
sock=urllib2.urlopen(req)
文件 "/usr/lib64/python2.4/urllib2.py",第130行,在 urlopen
return _opener.open(url, data)
文件 "/usr/lib64/python2.4/urllib2.py",第358行,在 open
response = self._open(req, data)
文件 "/usr/lib64/python2.4/urllib2.py",第376行,在 _open
'_open', req)
文件 "/usr/lib64/python2.4/urllib2.py",第337行,在 _call_chain
result = func(*args)
文件 "/usr/lib64/python2.4/urllib2.py",第573行,在
lambda r, proxy=url, type=type, meth=self.proxy_open:
文件 "/usr/lib64/python2.4/urllib2.py",第580行,在 proxy_open
if '@' in host:
类型错误:需要可迭代的参数

我不知道为什么字符 "@" 会成为问题(我的代码里没有这个字符。我应该加上吗?)
提前感谢你们的宝贵帮助。

2 个回答

0

这里提到的@其实并不是问题的关键,真正的问题在于它试图执行一个x in host的操作。在这个情况下,host必须是可以被遍历的东西,比如字符串。你需要检查一下host的值,可能它是None或者一个数字,这并不是你想要的结果。

3

urllib2.build_opener 这个函数需要一个处理器的 列表

opener = urllib2.build_opener([proxy_handler])

撰写回答