动态的 start_urls 值
我刚接触scrapy和python。我写了一个爬虫,使用初始化的start_urls值时运行得很好。
如果我在代码的初始化部分直接写一个网址,比如:
{ self.start_urls = 'http://something.com' }
这样也能正常工作。
但是,当我从一个json文件中读取这个值并创建一个列表时,我又遇到了一个关于缺少%20的错误。
我觉得在scrapy或python中我可能漏掉了什么明显的东西,因为我还是个新手。
class SiteFeedConstructor(CrawlSpider, FeedConstructor):
name = "Data_Feed"
start_urls = ['http://www.cnn.com/']
def __init__(self, *args, **kwargs):
FeedConstructor.__init__(self, **kwargs)
kwargs = {}
super(SiteFeedConstructor, self).__init__(*args, **kwargs)
self.name = str(self.config_json.get('name', 'Missing value'))
self.start_urls = str(self.config_json.get('start_urls', 'Missing value'))
self.start_urls = self.start_urls.split(",")
错误信息:
Traceback (most recent call last):
File "/usr/bin/scrapy", line 4, in <module>
execute()
File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 132, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 97, in _run_print_help
func(*a, **kw)
File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 139, in _run_command
cmd.run(args, opts)
File "/usr/lib/python2.7/dist-packages/scrapy/commands/runspider.py", line 64, in run
self.crawler.crawl(spider)
File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 42, in crawl
requests = spider.start_requests()
File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 55, in start_requests
reqs.extend(arg_to_iter(self.make_requests_from_url(url)))
File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 59, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 26, in __init__
self._set_url(url)
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 61, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: Missing%20value
1 个回答
1
与其定义 __init__()
方法,不如重写 start_requests()
方法:
这个方法是在爬虫启动进行抓取时被 Scrapy 调用的,前提是没有指定特定的 URL。如果指定了特定的 URL,Scrapy 会使用 make_requests_from_url() 方法来创建请求。这个方法在 Scrapy 中只会被调用一次,所以可以安全地将它实现为一个生成器。
class SiteFeedConstructor(CrawlSpider, FeedConstructor):
name = "Data_Feed"
def start_requests(self):
self.name = str(self.config_json.get('name', 'Missing value'))
for url in str(self.config_json.get('start_urls', 'Missing value')).split(","):
yield self.make_requests_from_url(url)