使用Python请求获取html？

>>> r.text '\x1f\ufffd\x08\x00\x00\x00\x00\x00\x00\x03\ufffd]o\u06f8\x12\ufffd\ufffd\ufffd+\ufffd]... >>> r.content b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\x9d]o\xdb\xb8\x12\x86\xef\xfb+\x88]\x14h...

2条回答

网友

1楼 · 编辑于 2024-05-16 13:18:55

有问题的服务器会给您一个gzip响应。服务器也很坏；它发送以下头：

$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html xmlns="http: //www.w3.org/1999/xhtml" lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html

那里的<!DOCTYPE..>行不是有效的HTTP头。因此，经过Server的其余头将被忽略。为什么服务器会插入不清楚的内容；在所有可能的hood中，WRCCWrappers.py是一个CGI脚本，它不输出头，但在doctype行后面包含一个双换行符，欺骗Apache服务器在那里插入额外的头。

因此，requests也没有检测到数据是gzip编码的。所有的数据都在那里，你只要解码就行了。如果不是很不完整的话你也可以。

解决方法是告诉服务器不要为压缩而烦恼：

headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)

并返回未压缩的响应。

顺便说一句，在Python 2上，HTTP头解析器并不那么严格，它设法将doctype声明为头：

>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html xmlns="http': '//www.w3.org/1999/xhtml" lang="en-US">',
 'connection': 'Keep-Alive',
 'content-encoding': 'gzip',
 'content-length': '3659',
 'content-type': 'text/html',
 'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
 'keep-alive': 'timeout=5, max=100',
 'server': 'Apache',
 'vary': 'Accept-Encoding'}

而content-encoding信息仍然存在，因此requests将按预期为您解码内容。

网友

2楼 · 编辑于 2024-05-16 13:18:55

此URL的HTTP头现在已修复。

>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'\n<!DOCTYPE html>\n<HTML>\n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}

相关问题更多 >

编程相关推荐

热门问题

热门文章