爬取数据时出现ASCII编码错误

1 投票

2 回答

760 浏览

提问于 2025-04-17 22:25

我正在尝试从一个网站上抓取手机信息。手机的信息格式是这样的：

+971553453301‪

这是我为这个任务写的代码：

try:
    phone=soup.find("div", "phone-content")
    for a in phone:
        phone_result= str(a).get_text().strip().encode("utf-8")
    print "Phone information:", phone_result
except StandardError as e:
    phone_result="Error was {0}".format(e)
    print phone_result

我遇到的错误是：

'ascii' codec can't encode character u'\u202a' in position 54: ordinal not in range(128)

有人能帮帮我吗？

数据处理网页抓取编码错误数据爬取

2 个回答

试试把 str(a) 换成 unicode(a)，然后省略 .encode() 这一部分。

回答于 2025-04-17 由 Python大师

分享举报

这段代码有几个地方比较尴尬：

phone_result= str(a).get_text().strip().encode("utf-8")

首先，BeautifulSoup 是处理 Unicode 的，所以在 Python 2 中把它的文本转成 str 是容易出错的。我觉得这里就是个错误，因为即使转换成功，你对一个 str 对象调用 get_text() 也会引发 NameError 错误。

最后，你对 str 调用 encode，而在 Python 2 中它已经是编码过的了，这可能会出问题，因为 Python 2 会先解码（用默认编码），然后再编码一次。

所以可以试试这个修复，假设网页是用 utf8 编码的：

phone_result= a.get_text().strip().encode("utf-8")

这行代码也有问题：

phone=soup.find("div", "phone-content")

find 只会返回一个结果，也就是一个 Tag 对象，建议你使用 find_all，这样可以返回一个 Tag 对象的列表。两者的区别在于，当你遍历单个 Tag 对象的结果时，会得到 NavigableString，而它没有 get_text 方法。而当你遍历一个 Tag 对象的列表时，你得到的依然是 Tag 对象，它们是有 get_text 方法的。

希望这些能帮到你！

回答于 2025-04-17 由 Python大师

分享举报

爬取数据时出现ASCII编码错误

2 个回答

撰写回答