无法使用BeautifulSoup正确显示字符

# imports import requests from bs4 import BeautifulSoup from bs4 import NavigableString # create beautifulsoup object obce_url = 'http://www.e-obce.sk/zoznam_vsetkych_obci.html?strana=2500' source_code = requests.get(obce_url) plain_text = source_code.text obce_soup = BeautifulSoup(plain_text, 'html.parser') # define bs filter def soup_filter_1(tag): return tag.has_attr('href') and len(tag.attrs) == 1 and isinstance(tag.next_element, NavigableString) # print settlement names for tag in obce_soup.find_all(soup_filter_1): print(tag.string)

3条回答

网友

1楼 · 编辑于 2024-06-08 19:26:57

服务器可能会发送有关UTF-8的HTTP头信息，但HTML使用Win-1250。所以requests使用UTF-8来解码数据。你知道吗

但是您可以获得原始数据source_code.content，并使用decode('cp1250')来获得正确的字符。你知道吗

plain_text = source_code.content.decode('cp1250')

或者您可以在获得text之前手动设置encoding

source_code.encoding = 'cp1250'

plain_text = source_code.text

您还可以在BS中使用原始数据source_code.content，因此它应该使用有关编码的HTML信息

 obce_soup = BeautifulSoup(source_code.content, 'html.parser')

看到了吗

 print(obce_soup.declared_html_encoding)

网友

2楼 · 编辑于 2024-06-08 19:26:57

因为您知道站点的编码，所以只需将其显式传递给带有响应内容的BeautifulSoup构造函数，而不是文本：

source_code = requests.get(obce_url)
content = source_code.content
obce_soup = BeautifulSoup(content, 'html.parser', from_encoding='windows-1250')

网友

3楼 · 编辑于 2024-06-08 19:26:57

问题不在于beautifulsoup，它只是无法确定您使用的是什么编码（请尝试print('encoding', obce_soup.original_encoding)），而这是由于您将其处理为Unicode而不是字节造成的。你知道吗

如果您尝试这样做：

obce_url = 'http://www.e-obce.sk/zoznam_vsetkych_obci.html?strana=2500'
source_code = requests.get(obce_url)
data_bytes = source_code.content  # don't use .text it will try to make Unicode
obce_soup = BeautifulSoup(data_bytes, 'html.parser')
print('encoding', obce_soup.original_encoding)

要创建beautifulsoup对象，您将看到它现在正确编码，输出正常。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章