使用Scrapy用单个蜘蛛抓取多个网站问题的回答

使用Scrapy用单个蜘蛛抓取多个网站

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我正在使用Scrapy从<a href="http://jobs.placementindia.com/lucknow" rel="nofollow">this website</a>中获取数据。下面是蜘蛛的代码。在 <pre><code>class StackItem(scrapy.Item): def __setitem__(self, key, value): if key not in self.fields: self.fields[key] = scrapy.Field() self._values[key] = value class betaSpider(CrawlSpider): name = "betaSpider" def __init__(self, *args, **kwargs): super(betaSpider, self).__init__(*args, **kwargs) self.start_urls = [kwargs.get('start_url')] rules = (Rule (LinkExtractor(unique=True, allow=('.*\?id1=.*',),restrict_xpaths=('//a[@class="prevNext next"]',)), callback="parse_items", follow= True),) def parse_items(self, response): hxs = HtmlXPathSelector(response) posts = hxs.select("//article[@class='classified']") items = [] for post in posts: item = StackItem() item["job_role"] = post.select("div[@class='uu mb2px']/a/strong/text()").extract() item["company"] = post.select("p[1]/text()").extract() item["location"] = post.select("p[@class='mb5px b red']/text()").extract() item["desc"] = post.select("details[@class='aj mb10px']/text()").extract() item["read_more"] = post.select("div[@class='uu mb2px']/a/@href").extract() items.<a href="https://www.cnpython.com/list/append" class="inner-link">append</a>(item) for item in items: yield item </code></pre> 以下是物料管道的代码： ^{pr2}$ 这个很好用。现在，我必须使用相同的蜘蛛抓取以下网站（例如）。在 <ol> <li><a href="http://www.freejobalert.com/government-jobs/" rel="nofollow">http://www.freejobalert.com/government-jobs/</a></li> <li><a href="https://www.sarkariexaam.com/" rel="nofollow">https://www.sarkariexaam.com/</a></li> </ol> 我必须刮去上述网站的所有标签，使用项目管道将其存储到CSV文件中。在 实际上，被废弃的网站名单是无穷无尽的。在这个项目中，用户将输入url并将废弃的结果返回给该用户。所以，我想要一个通用蜘蛛，可以刮任何网站。在 对于一个单一的网站来说，它运行良好。但是，对于结构不同的多个站点，如何实现呢？破破烂烂的够解决吗？在

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

使用Scrapy用单个蜘蛛抓取多个网站

1 个回答

相关Python问题