无法解码Python中BeautifulSoup的输出

0 投票

2 回答

2818 浏览

提问于 2025-04-17 07:35

我一直在用Python和BeautifulSoup写一个小爬虫。整个过程都很顺利，直到我尝试打印（或者写入文件）各种HTML元素里面的字符串。我要爬取的网站是：http://www.yellowpages.ca/search/si/1/Boots/Montreal+QC，这个网站上有很多法语字符。奇怪的是，当我尝试在终端打印内容或者写入文件时，字符串没有像应该那样解码，而是显示了原始的unicode输出。

这是我的代码：

from BeautifulSoup import BeautifulSoup as bs
import urllib as ul
##import re

base_url = 'http://www.yellowpages.ca'
data_file = open('yellow_file.txt', 'a')

data = ul.urlopen(base_url + '/locations/Quebec/Montreal/90014002.html').readlines()

bt = bs(str(data))

result = bt.findAll('div', 'ypgCategory')

bt = bs(str(result))

result = bt.findAll('a')

for tag in result:
    link = base_url + tag['href']
    ##print str(link)
    data = ul.urlopen(link).readlines()

    #data = str(data).decode('latin-1')
    bt = bs(str(data), convertEntities=bs.HTML_ENTITIES, fromEncoding='latin-1')
    titles = bt.findAll('span', 'listingTitle')
    phones = bt.findAll('a', 'phoneNumber')

    entries = zip(titles, phones)

    for title, phone in entries:
        #print title.prettify(encoding='latin-1')
        #data_file.write(title.text.decode('utf-8') + "   " + phone.text.decode('utf-8') + "\n")
        print title.text

data_file.close()

/************/

这个代码的输出是：Projets Autochtones Du Qu\xc3\xa9bec

你可以看到，应该在“Quebec”里的带重音的“e”没有显示出来。我尝试了在StackOverflow上看到的所有方法，比如调用unicode()、给soup传递fromEncoding、使用.decode('latin-1')，但都没有效果。

有没有什么建议？

unicode 字符编码数据提取 html解析 beautifulsoup 编码问题爬虫法语字符

2 个回答

谁告诉你用 latin-1 来解码一个 UTF-8 的东西呢？（在meta标签里已经明确说明了）

如果你在Windows上，可能会遇到在控制台输出Unicode字符的问题，最好先测试一下写入文本文件。
如果你是以文本方式打开一个文件，就不要往里面写二进制数据：
- codecs.open(...,"w","utf-8").write(unicode_str)
- open(...,"wb").write(unicode_str.encode("utf_8"))

回答于 2025-04-17 由 Python大师

分享举报

这应该是你想要的东西：

from BeautifulSoup import BeautifulSoup as bs
import urllib as ul

base_url = 'http://www.yellowpages.ca'
data_file = open('yellow_file.txt', 'a')

bt = bs(ul.urlopen(base_url + '/locations/Quebec/Montreal/90014002.html'))

for div in bt.findAll('div', 'ypgCategory'):
    for a in div.findAll('a'):
        link = base_url + a['href']

        bt = bs(ul.urlopen(link), convertEntities=bs.HTML_ENTITIES)

        titles = bt.findAll('span', 'listingTitle')
        phones = bt.findAll('a', 'phoneNumber')

        for title, phone in zip(titles, phones):
            line = '%s   %s\n' % (title.text, phone.text)
            data_file.write(line.encode('utf-8'))
            print line.rstrip()

data_file.close()

回答于 2025-04-17 由 Python大师

分享举报

无法解码Python中BeautifulSoup的输出

2 个回答

撰写回答