Scrapy自定义链接提取器限制跟踪链接数量

2024-06-16 09:45:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试基于Scrapy的LxmlLinkExtractor编写一个自定义链接提取器。这样做的目的是在达到限制后,包含一个maxpages参数来停止跟踪该域的链接(并转到下一个)。但是,我无法使自定义链接提取器工作:

from scrapy.linkextractors.lxmlhtml import *

class LimitedLinkExtractor(FilteringLinkExtractor):

    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=False,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=(),
                 strip=True, maxpages=10): #added maxpages 

        tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
        tag_func = lambda x: x in tags
        attr_func = lambda x: x in attrs

        lx = LxmlParserLinkExtractor(
            tag=tag_func,
            attr=attr_func,
            unique=unique,
            process=process_value,
            strip=strip,
            canonicalized=canonicalize,
        )

        super(FilteringLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
            allow_domains=allow_domains, deny_domains=deny_domains,
            restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
            canonicalize=canonicalize, deny_extensions=deny_extensions,maxpages=maxpages) #added maxpages 

    def extract_links(self, response):
        base_url = get_base_url(response)
        if self.restrict_xpaths:
            docs = [subdoc
                    for x in self.restrict_xpaths
                    for subdoc in response.xpath(x)]
        else:
            docs = [response.selector]

        all_links = []
        for doc in docs:
            links = self._extract_links(doc, response.url, response.encoding, base_url)
            links = links[0:self.max_pages] #added maxpages 
            all_links.extend(self._process_links(links))
        return unique_list(all_links)

除了我对#added maxpages的注释外,其他内容都与Scrapy在lxmlhtml.py中默认提供的LxmlLinkExtractor相同。我得到的错误是:

"TypeError: object.__init__() takes exactly one argument (the instance to initialize)"


Tags: inselfresponsetagslinksprocessattrsunique