带“xe2x80x93”“ch”的普通字符串

content = str(urllib.request.urlopen(site, timeout=10).read()) g = content.split('<h1 itemprop="name"')[1].split('</span></h1>')[0].split('<span>')[1].replace("\\", "") print(type(g)) --> string print(g) --> "Flash xe2x80x93 der rote Blitz" print(g.encode('latin-1').decode('utf-8')) --> AttributeError: 'str' object has no attribute 'decode' print(repr(g.decode('unicode-escape'))) --> AttributeError: 'str' object has no attribute 'decode' print(g.encode('ascii','replace')) --> b'Flash xe2x80x93 der rote Blitz' print(bytes(g, "utf-8").decode()) --> "Flash xe2x80x93 der rote Blitz" print(bytes(g, "utf-8").decode("unicode_escape")) --> "Flash â der rote Blitz"

1条回答

网友

1楼 · 发布于 2024-05-17 18:07:53

你对decode的想法是正确的。你知道吗

通过将输出包装在str(...)中的这一行：

content = str(urllib.request.urlopen(site, timeout=10).read())

您可以将一个bytes对象转换为一个字符串（在content中，前面的b'和后面的'就是很明显的），或者，如果它已经被解码为ISO-8859-1，则什么也不做。你知道吗

在这两种情况下，都不要删除包装str调用。你知道吗

现在，内容将是bytes对象或str对象。你知道吗

因此，如果它是一个字符串，它将被解码（错误地）为ISO-8859-1。您需要将其编码回bytes对象，然后正确解码：

content = urllib.request.urlopen(site, timeout=10).read()

if isinstance(content, str):
    content = content.encode('iso-8859-1')
content = content.decode('utf8')

现在，您的\xe2\x80\x93字节应该正确地显示为：–

更新：

从你的评论来看，你需要做的就是：

content = urllib.request.urlopen(site, timeout=10).read().decode('utf8')

相关问题更多 >

编程相关推荐

热门问题

热门文章