Scrapy中的变量
我可以在start_urls里使用变量吗?请看下面的脚本:
这个脚本运行得很好:
from scrapy.spider import Spider
from scrapy.selector import Selector
from example.items import ExampleItem
class ExampleSpider(Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/search-keywords=['0750692995']",
"http://www.example.com/search-keywords=['0205343929']",
"http://www.example.com/search-keywords=['0874367379']",
]
def parse(self, response):
hxs = Selector(response)
item = ExampleItem()
item['url'] = response.url
item['price'] = hxs.select("//li[@class='mpbold']/a/text()").extract()
item['title'] = hxs.select("//span[@class='title L']/text()").extract()
return item
但是我想要这样:
from scrapy.spider import Spider
from scrapy.selector import Selector
from example.items import ExampleItem
class ExampleSpider(Spider):
name = "example"
allowed_domains = ["example.com"]
pro_id = ["0750692995", "0205343929", "0874367379"] ***(I added this line)
start_urls = [
"http://www.example.com/search-keywords=['pro_id']", ***(and I changed this line)
]
def parse(self, response):
hxs = Selector(response)
item = ExampleItem()
item['url'] = response.url
item['price'] = hxs.select("//li[@class='mpbold']/a/text()").extract()
item['title'] = hxs.select("//span[@class='title L']/text()").extract()
return item
我想通过把pro_id号码一个一个地放进start_urls函数来运行这个脚本。有没有办法做到这一点?我运行脚本后,网址还是像这样 "http://www.example.com/search-keywords=['pro_id']",而不是 "http://www.example.com/search-keywords=0750692995"。这个脚本应该怎么写呢?谢谢你的帮助。
编辑:在按照@paul t的建议修改后,出现了以下错误
2014-03-02 08:39:44+0700 [example] ERROR: Obtaining request from start requests
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1192, in run
self.mainLoop()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "C:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
--- <exception caught here> ---
File "C:\Python27\lib\site-packages\scrapy-0.22.2-py2.7.egg\scrapy\core\engine.py", line 111, in _next_request
request = next(slot.start_requests)
File "C:\Users\S\desktop\example\example\spiders\example_spider.py", line 13, in start_requests
yield Request(self.start_urls_base % pro_id, dont_filter=True)
exceptions.NameError: global name 'Request' is not defined
3 个回答
0
我觉得你可以用一个for循环来解决这个问题,像下面这样:
start_urls = [
"http://www.example.com/search-keywords="+i for i in pro_id
]
0
首先,你需要导入Request这个模块。
from scrapy.http import Request
之后,你可以按照Paul的建议进行操作。
def start_requests(self):
for pro_id in self.pro_ids:
yield Request(self.start_urls_base % pro_id, dont_filter=True)
5
一种实现这个的方法是重写爬虫的 start_requests()
方法:
class ExampleSpider(Spider):
name = "example"
allowed_domains = ["example.com"]
pro_ids = ["0750692995", "0205343929", "0874367379"]
start_urls_base = "http://www.example.com/search-keywords=['%s']"
def start_requests(self):
for pro_id in self.pro_ids:
yield Request(self.start_urls_base % pro_id, dont_filter=True)