BeautifulSoup在meta标签上出错

def get_doc_ondrive(self,mypath): the_file = open(mypath,"r") line = the_file.readline() if(line != "")and (line!=None): self.soup = BeautifulSoup(line) else: print "Something is wrong with line:\n\n%r\n\n" % line quit() print "\t\t------------ line: %r ---------------\n" % line while line != "": line = the_file.readline() print "\t\t------------ line: %r ---------------\n" % line if(line != "")and (line!=None): print "\t\t\tinner if executes: line: %r\n" % line self.soup.feed(line) self.get_word_vector() self.has_doc = True

Traceback (most recent call last): File "test_docs.py", line 28, in <module> newdoc.get_doc_ondrive(testeee) File "/home/jddancks/Capstone/Python/code/pkg/vectors/DOCUMENT.py", line 117, in get_doc_ondrive self.soup.feed(line) File "/usr/lib/python2.7/sgmllib.py", line 104, in feed self.goahead(0) File "/usr/lib/python2.7/sgmllib.py", line 139, in goahead k = self.parse_starttag(i) File "/usr/lib/python2.7/sgmllib.py", line 298, in parse_starttag self.finish_starttag(tag, attrs) File "/usr/lib/python2.7/sgmllib.py", line 348, in finish_starttag self.handle_starttag(tag, method, attrs) File "/usr/lib/python2.7/sgmllib.py", line 385, in handle_starttag method(attrs) File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1618, in start_meta self._feed(self.declaredHTMLEncoding) File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1172, in _feed smartQuotesTo=self.smartQuotesTo, isHTML=isHTML) File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1776, in __init__ self._detectEncoding(markup, isHTML) File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 1922, in _detectEncoding '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data) TypeError: expected string or buffer

1条回答

网友

1楼 · 发布于 2024-04-16 04:39:11

我刚刚通读了the source，我想我理解了这个问题。基本上，以下是BeautifulSoup认为事情应该这样发展的：

使用整个标记调用BeautifulSoup。在
它将self.markup设置为该标记。在
它自己调用_feed，重置文档并以最初检测到的编码对其进行解析。在
当它自己进食时，它会发现一个meta标记，它声明了一种不同的编码方式。在
要使用这种新编码，它将再次对自身调用_feed，这将重新解析self.markup。在
在第一个_feed以及它递归到的_feed完成后，它将self.markup设置为None。（毕竟，我们现在已经解析了所有的东西；<sarcasm>谁还需要原始标记呢？</sarcasm>）

但你的使用方式：

使用标记的第一行调用BeautifulSoup。在
它将self.markup设置为标记的第一行，并调用_feed。在
_feed在第一行没有看到有趣的meta标记，因此成功结束。在
构造函数认为我们已经完成了解析，所以它将self.markup设置回None并返回。在
在BeautifulSoup对象上调用feed，该对象直接指向SGMLParser.feed实现，该实现不会被BeautifulSoup覆盖。在
它看到一个有趣的meta标记，并调用_feed来解析这种新编码的文档。在
_feed试图用self.markup构造一个UnicodeDammit对象。在
它会爆炸，因为self.markup是None，因为它认为它只会在BeautifulSoup的构造函数中被调用。在

这个故事的寓意是feed是一种不受支持的向BeautifulSoup发送输入的方式。你必须一次传递所有的输入。在

至于为什么BeautifulSoup(open(mypath, "r"))返回None，我不知道；我没有看到__new__定义在BeautifulSoup上，所以它似乎必须返回一个BeautifulSoup对象。在

尽管如此，你可能想考虑使用beauthoulsoup4而不是3。Here’s the porting guide.为了支持python3，它必须删除对SGMLParser的依赖，如果在重写的这一部分中，您遇到的任何bug都得到了修复，我也不会感到惊讶。在

相关问题更多 >

编程相关推荐

热门问题

热门文章