Scrapy Python爬虫无法通过LinkExtractor或手动Request()找到链接

Question

我正在尝试写一个Scrapy爬虫，目的是在这个网站上爬取所有结果页面：https://www.ghcjobs.apply2jobs.com...。这个代码需要完成三件事：

(1) 爬取从第1页到第1000页的所有页面。这些页面基本上是一样的，唯一的区别在于网址的最后部分：&CurrentPage=#。

(2) 跟踪结果表格中包含职位发布的每个链接，链接的类名是SearchResult。这是表格中唯一的链接，所以我在这方面没有问题。

(3) 以键值对的JSON格式存储职位描述页面上显示的信息。（这一部分基本上是可以工作的）

我之前用过scrapy和CrawlSpiders，使用过'rule = [Rule(LinkExtractor(allow='这种方法来递归解析页面，找到所有符合给定正则表达式模式的链接。目前，我在第一步上遇到了困难，就是爬取这千个结果页面。

下面是我的爬虫代码：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.http.request import Request
from scrapy.contrib.linkextractors import LinkExtractor
from genesisSpider.items import GenesisJob

class genesis_crawl_spider(CrawlSpider):
    name = "genesis"
    #allowed_domains = ['http://www.ghcjobs.apply2jobs.com']
    start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1']

    #allow &CurrentPage= up to 1000, currently ~ 512
    rules = [Rule(LinkExtractor(allow=("^https://www.ghcjobs.apply2jobs.com/ProfExt/
index.cfm\?fuseaction=mExternal.returnToResults&CurrentPage=[1-1000]$")), 'parse_inner_page')]

def parse_inner_page(self, response):
    self.log('===========Entrered Inner Page============')
    self.log(response.url)
    item = GenesisJob()
    item['url'] = response.url

    yield item

这是爬虫的输出，顶部有一部分执行代码被剪掉了：

2014-09-02 16:02:48-0400 [genesis] DEBUG: Crawled (200) <GET https://www.ghcjobs
.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPa
ge=1> (referer: None) ['partial']
2014-09-02 16:02:48-0400 [genesis] DEBUG: Crawled (200) <GET https://www.ghcjobs
.apply2jobs.com/ProfExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToRes
ults> (referer: https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=
mExternal.returnToResults&CurrentPage=1) ['partial']
2014-09-02 16:02:48-0400 [genesis] DEBUG: ===========Entrered Inner Page========
====
2014-09-02 16:02:48-0400 [genesis] DEBUG: https://www.ghcjobs.apply2jobs.com/Pro
fExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToResults
2014-09-02 16:02:48-0400 [genesis] DEBUG: Scraped from <200 https://www.ghcjobs.
apply2jobs.com/ProfExt/index.cfm?CurrentPage=1&fuseaction=mExternal.returnToResu
lts>
        {'url': 'https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?CurrentPag
e=1&fuseaction=mExternal.returnToResults'}
2014-09-02 16:02:48-0400 [genesis] INFO: Closing spider (finished)
2014-09-02 16:02:48-0400 [genesis] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 930,
         'downloader/request_count': 2,
         'downloader/request_method_count/GET': 2,
         'downloader/response_bytes': 92680,
         'downloader/response_count': 2,
         'downloader/response_status_count/200': 2,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2014, 9, 2, 20, 2, 48, 611000),
         'item_scraped_count': 1,
         'log_count/DEBUG': 7,
         'log_count/INFO': 7,
         'request_depth_max': 1,
         'response_received_count': 2,
         'scheduler/dequeued': 2,
         'scheduler/dequeued/memory': 2,
         'scheduler/enqueued': 2,
         'scheduler/enqueued/memory': 2,
         'start_time': datetime.datetime(2014, 9, 2, 20, 2, 48, 67000)}
2014-09-02 16:02:48-0400 [genesis] INFO: Spider closed (finished)

目前，我在这个项目的目标(1)上卡住了。如你所见，我的爬虫只爬取了start_url页面。我的正则表达式应该是正确的，因为我已经测试过了。我的回调函数parse_inner_page也在正常工作，调试时插入的注释也显示了这一点，但只在第一页上。难道我使用'Rule'的方式不对吗？我在想，可能是因为页面是HTTPS导致的问题……

为了尝试解决这个问题，我手动请求了第二页的结果；但这并没有成功。这里是那部分代码。

Request("https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=2",  callback = 'parse_inner_page')

有没有人能给我一些指导？有没有更好的方法来做这个？自上周五以来，我一直在StackOverflow和Scrapy文档上研究这个问题。非常感谢。

更新：我已经解决了这个问题。问题出在我使用的起始网址上。

start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.returnToResults&CurrentPage=1']

这个网址会导致一个表单提交后的页面，这是点击这个页面上的“搜索”按钮后得到的结果。这个过程在客户端运行JavaScript来提交表单给服务器，从而返回完整的职位列表，页面从1到512。然而，还有另一个硬编码的URL，显然可以直接调用服务器，而不需要使用任何客户端的JavaScript。所以现在我的起始网址是

start_urls = ['https://www.ghcjobs.apply2jobs.com/ProfExt/index.cfm?fuseaction=mExternal.searchJobs']

一切都回到正轨了！将来，记得检查是否有任何不依赖JavaScript的URL来调用服务器资源。

https web scraping scrapy regex form submission crawlspider json storage linkextractor

Scrapy Python爬虫无法通过LinkExtractor或手动Request()找到链接

1 个回答

撰写回答