为什么Python脚本通过代理下载网页失败?
我刚开始学习Python,正在尝试使用套接字(sockets)。所以我写了一个简单的HTTP客户端,但让我惊讶的是,它无法访问Firefox可以访问的网页,尽管它们使用的是相同的请求头。
import socket
clientsocket= socket.socket(socket.AF_INET, socket.SOCK_STREAM)
clientsocket.connect(("213.229.83.205",80))#connect to proxy at given address
print "connected to 213.229.83.205"
sdata= """GET http://google.co.ug/ HTTP/1.1
Host: google.co.ug
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20100101 Firefox/6.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Proxy-Connection: keep-alive
Cookie: cookie <-- Real cookie deleted
"""
print "sending request"
clientsocket.send(sdata);
rdata=clientsocket.recv(10240)
if not rdata: print "no data found"
else:
print "receiving data !"
myfile=open("c:/users/markdenis/desktop/google.html","w")
myfile.write(str(rdata))
myfile.close()
print "data written to file on desktop"
clientsocket.close()
raw_input()#system(pause)
当我运行它的时候,它显示:
connected to 213.229.83.205
sending request
no data found
1 个回答
5
HTTP协议要求每个头部的结尾都要有\r\n
,而在HTTP头部的最后一行空行也需要再加一个\r\n
。你在sdata
这个缓冲区里没有明确指定行结束符,所以你的缓冲区里只用了\n
作为行结束符。
为了确保这一点,我在Windows、Linux和OS X上都进行了测试:
>>> x = """a
b
c"""
>>> x
'a\\nb\\nc\\n'
你需要在这里:
>>> x = "a\r\nb\r\nc\r\n"
>>> x
'a\\r\\nb\\r\\nc\\r\\n'
加上\r\n
,然后试试看。直接在缓冲区里加会多出一组\n
,所以要分开来处理:
sdata = "GET http://google.co.ug/ HTTP/1.1\r\n"
sdata += "Host: google.co.ug\r\n"
sdata += "User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:6.0) Gecko/20100101 Firefox/6.0\r\n"
sdata += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
sdata += "Accept-Language: en-us,en;q=0.5\r\n"
sdata += "Accept-Encoding: gzip, deflate\r\n"
sdata += "Proxy-Connection: keep-alive\r\n"
sdata += "\r\n"