使用BeautifulSoup转义……

1 投票

2 回答

2462 浏览

提问于 2025-04-16 00:40

我现在正在用BeautifulSoup来抓取一些网站的信息，不过我遇到了一些特定字符的问题，UnicodeDammit里面的代码似乎又提到了这些字符，看来是一些微软发明的字符。

我正在使用最新版本的BeautifulSoup（3.0.8.1），因为我还在用python2.5。

下面的代码展示了我的问题：

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version&hellip;')
print soup

'...Baby One More Time (Digital Deluxe Version&hellip;'

你可以看到问题出在最后的那个'…'（&hellip）字符上（你的浏览器可能已经正确处理了这个字符）。显然，这不是我想要的。

如果能得到这个字符的unicode表示或者其他什么方法就好了。其实简单地忽略这个字符也能解决我的问题。

我该如何用BeautifulSoup来做到这一点呢？

software development unicode data extraction character encoding web scraping beautifulsoup text processing html parsing

2 个回答

虽然微软可能是最早提出这个概念的，但 … 实际上是HTML 4的一部分：http://www.w3.org/TR/REC-html40/sgml/entities.html

可能你的 Lib/htmlentitydefs.py 文件缺失或者版本过旧，因为BeautifulSoup就是用这个文件来转换实体的。

如果你查看 Python 2.5的源代码，你会在第126行看到它的定义。

回答于 2025-04-16 由 Python大师

分享举报

我自己找到了解决办法：

soup = BeautifulSoup('...Baby One More Time (Digital Deluxe Version&hellip;', convertEntities="html")

回答于 2025-04-16 由 Python大师

分享举报

使用BeautifulSoup转义……

2 个回答

撰写回答