子类化BeautifulSoup HTML解析器，出现类型错误

Question

我写了一个小工具，使用了很棒的HTML解析库BeautifulSoup。

最近我想改进一下代码，让所有的BeautifulSoup方法都能直接在这个工具类里使用（而不是通过类的属性来使用），我觉得继承BeautifulSoup解析器是实现这个目标的最佳方法。

这是我的类：

class ScrapeInputError(Exception):pass
from BeautifulSoup import BeautifulSoup

class Scrape(BeautifulSoup):
    """base class to be subclassed
    basically a subclassed BeautifulSoup wrapper that providers
    basic url fetching with urllib2
    and the basic html parsing with beautifulsoup
    and some basic cleaning of head,scripts etc'"""

    def __init__(self,file):
        self._file = file
        #very basic input validation
        import re
        if not re.search(r"^http://",self._file):
            raise ScrapeInputError,"please enter a url that starts with http://"

        import urllib2
        #from BeautifulSoup import BeautifulSoup
        self._page = urllib2.urlopen(self._file) #fetching the page
        BeautifulSoup.__init__(self,self._page)
        #self._soup = BeautifulSoup(self._page) #calling the html parser

这样我就可以用下面的方式来初始化这个类：

x = Scrape("http://someurl.com")

然后可以通过x.elem或者x.find来遍历树状结构。

这在某些BeautifulSoup方法上效果很好（见上文），但在使用迭代器的情况下就不行了，比如“for e in x:”这种写法。

错误信息是：

 Traceback (most recent call last):
  File "<pyshell#86>", line 2, in <module>
    print e
  File "C:\Python27\lib\idlelib\rpc.py", line 595, in __call__
    value = self.sockio.remotecall(self.oid, self.name, args, kwargs)
  File "C:\Python27\lib\idlelib\rpc.py", line 210, in remotecall
    seq = self.asynccall(oid, methodname, args, kwargs)
  File "C:\Python27\lib\idlelib\rpc.py", line 225, in asynccall
    self.putmessage((seq, request))
  File "C:\Python27\lib\idlelib\rpc.py", line 324, in putmessage
    s = pickle.dumps(message)
  File "C:\Python27\lib\copy_reg.py", line 77, in _reduce_ex
    raise TypeError("a class that defines __slots__ without "
TypeError: a class that defines __slots__ without defining __getstate__ cannot be pickled

我查了一下这个错误信息，但找不到可以解决的办法——因为我不想去碰BeautifulSoup的内部实现（说实话，我也不懂__slot__或__getstate__这些东西……）我只是想用它的功能。

我尝试过不继承，而是在类的__init__方法中返回一个BeautifulSoup对象，但__init__方法返回的是None。

希望能得到任何帮助。

迭代器类型错误编程问题方法重写 html解析子类化 beautifulsoup 树状结构

子类化BeautifulSoup HTML解析器，出现类型错误

1 个回答

撰写回答