BeautifulSoup findall 使用类属性时的 Unicode 编码错误

6 投票

3 回答

18580 浏览

提问于 2025-04-16 16:11

我正在使用BeautifulSoup这个工具，从Hacker News网站提取新闻故事的标题，目前我写了这些代码-

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles_html =[]

    for td in soup.findAll("td", { "class":"title" }):
        titles_html += td.findAll("a")

    return titles_html

print get_stories(get_page()

)

但是当我运行这段代码时，它出现了一个错误-

Traceback (most recent call last):
  File "terminalHN.py", line 19, in <module>
    print get_stories(get_page())
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe2' in position 131: ordinal not in range(128)

我该怎么做才能让它正常工作呢？

error handling data extraction web scraping beautifulsoup html parsing unicode encoding

3 个回答

这个程序运行得很好，出问题的是输出结果。你可以选择把输出内容转换成你控制台能识别的字符格式，或者换一种方式来运行你的代码，比如在IDLE里面运行。

回答于 2025-04-16 由 Python大师

分享举报

你代码中有一点需要注意的是，findAll这个方法会返回一个列表（在这里是一些BeautifulSoup对象的列表），而你其实只想要标题。你可以考虑用find这个方法来代替。这样的话，不用打印出一堆BeautifulSoup对象，而是直接获取你想要的标题。下面这个例子就很好：

import urllib2
from BeautifulSoup import BeautifulSoup

HN_url = "http://news.ycombinator.com"

def get_page():
    page_html = urllib2.urlopen(HN_url) 
    return page_html

def get_stories(content):
    soup = BeautifulSoup(content)
    titles = []

    for td in soup.findAll("td", { "class":"title" }):
        a_element = td.find("a")
        if a_element:
            titles.append(a_element.string)

    return titles

print get_stories(get_page())

所以现在get_stories()返回的是一个unicode对象的列表，打印出来的结果也正是你所期待的样子。

回答于 2025-04-16 由 Python大师

分享举报

因为BeautifulSoup内部使用的是unicode字符串。当你把unicode字符串打印到控制台时，Python会尝试把它转换成Python默认的编码格式，通常是ascii。对于非ascii的网站，这种转换一般会失败。你可以通过在网上搜索“python + unicode”来了解Python和Unicode的基础知识。同时，你可以使用以下方法把你的unicode字符串转换成utf-8格式：

print some_unicode_string.decode('utf-8')

回答于 2025-04-16 由 Python大师

分享举报

BeautifulSoup findall 使用类属性时的 Unicode 编码错误

3 个回答

撰写回答