擅长:python、mysql、java
<p>解决方案是在SgmlLinkExtractor中使用process\u link函数
这里的文档<a href="http://doc.scrapy.org/en/latest/topics/link-extractors.html" rel="nofollow noreferrer">http://doc.scrapy.org/en/latest/topics/link-extractors.html</a></p>
<pre><code>class testSpider(CrawlSpider):
name = "test"
bot_name = 'test'
allowed_domains = ["news.google.com"]
start_urls = ["https://news.google.com/"]
rules = (
Rule(SgmlLinkExtractor(allow_domains=()), callback='parse_items',process_links="filter_links",follow= True) ,
)
def filter_links(self, links):
for link in links:
if self.allowed_domains[0] not in link.url:
print link.url
return links
def parse_items(self, response):
### ...
</code></pre>