Beautiful Soup 和 uTidy

4 投票
2 回答
3582 浏览
提问于 2025-04-15 11:43

我想把 utidy 的结果传给 Beautiful Soup,像这样:

page = urllib2.urlopen(url)
options = dict(output_xhtml=1,add_xml_decl=0,indent=1,tidy_mark=0)
cleaned_html = tidy.parseString(page.read(), **options)
soup = BeautifulSoup(cleaned_html)

运行时,出现了以下错误:

Traceback (most recent call last):
  File "soup.py", line 34, in <module>
    soup = BeautifulSoup(cleaned_html)
  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1499, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1230, in __init__
    self._feed(isHTML=isHTML)
  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1245, in _feed
    smartQuotesTo=self.smartQuotesTo, isHTML=isHTML)
  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1751, in __init__
    self._detectEncoding(markup, isHTML)
  File "/var/lib/python-support/python2.6/BeautifulSoup.py", line 1899, in _detectEncoding
    xml_encoding_match = re.compile(xml_encoding_re).match(xml_data)
TypeError: expected string or buffer

我了解到,utidy 返回的是一个 XML 文档,而 BeautifulSoup 需要的是一个字符串。有没有办法把 cleaned_html 转换成字符串?还是说我做错了,应该换个方法?

2 个回答

2

把传给BeautifulSoup的值转换成字符串。
在你的情况下,最后一行需要做如下修改:

soup = BeautifulSoup(str(cleaned_html))
11

在把 cleaned_html 传给 BeautifulSoup 的时候,只需要把 str() 这个函数包裹起来就行了。

撰写回答