使用Python 3.x基本获取URL的HTML正文

3 投票

1 回答

2005 浏览

提问于 2025-04-16 16:00

我刚开始学Python，对Python 2.x中的旧版urllib和urllib2，以及Python 3中的新版urllib之间的区别有点困惑。还有就是，我不太确定在用urlopen发送数据之前，什么时候需要对数据进行编码。

我一直在尝试获取一个网址的HTML内容，使用POST请求，这样我就可以发送一些参数。这个网页显示的是某个国家在特定日期某个小时的阳光数据。我试过不进行编码和解码，结果打印出来的是一个以b开头的字节字符串。然后我尝试的代码是

import urllib.request, urllib.parse, urllib.error

def scrape(someurl):

    try:

        values = {'LANG': 'en',
                  'DATE' : '1303160400',
                  'CONT' : 'euro',
                  'LAND' : 'UK',
                  'KEY' : 'UK',
                  'SORT': '2',
                  'INT' : '06',
                  'TYPE' : 'sonnestd',
                  'ART' : 'karte',
                  'RUBRIK' : 'akt',
                  'R': '310',
                  'CEL': 'C'}

        data = urllib.parse.urlencode(values)
        data = data.encode("utf-8")
        response = urllib.request.urlopen(someurl, data)
        html = response.read().decode("utf-8")
        print(html)

    except urllib.error.HTTPError as e:
        print(e.code)
        print(e.read())

myscrape = scrape("http://www.weatheronline.co.uk/weather/maps/current")

出现的错误是

Traceback (most recent call last):
  File "/Users/Me/Desktop/weather.py", line 57, in <module>
    myscrape = scrape("http://www.weatheronline.co.uk/weather/maps/current")
  File "/Users/Me/Desktop/weather.py", line 37, in scrape
    html = response.read().decode("utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 10: invalid start byte

即使不进行编码和解码，我得到的字节字符串也异常短，所以我在想请求是否以其他方式失败了

b'GIF89a\x01\x00\x01\x00\x80\x00\x00\x00\x00\x00\x00\x00\x00!\xf9\x04\x01\x00\x00\x00\x00,\x00\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02D\x01\x00;'

error handling web scraping urllib url HTML post request byte string data encoding

1 个回答

GIF89a表示服务器正在发送给你一张图片。

另外，你不应该盲目地用UTF-8去解码；你应该查看响应头，找出应该使用哪种编码。

回答于 2025-04-16 由 Python大师

分享举报

使用Python 3.x基本获取URL的HTML正文

1 个回答

撰写回答