擅长:python、mysql、java
<p>您可以通过简单地反转<code>should_follow()</code>方法创建反向异地中间件:</p>
<pre><code>#mycrawler.middlewares.py
from scrapy.spidermiddlewares.offsite import OffsiteMiddleware
from scrapy.utils.httpobj import urlparse_cached
class ReverseOffsiteMiddleware(OffsiteMiddleware):
seen = set()
def should_follow(self, request, spider):
allowed_domains = not super().should_follow(request, spider)
# if failed to pass reverse allowed_domains don't follow
if not allowed_domains:
return False
# if visited domain before do not schedule request
domain = urlparse_cached(request).hostname
if domain in self.seen:
return False
# otherwise add to seen domain set and schedule request
self.seen.add(domain)
return True
</code></pre>
<p>然后在您的<code>settings.py</code>中激活它:</p>
^{pr2}$
<p>现在,<code>spider.allowed_domains</code>中的所有域都将被忽略:)</p>