我想废弃http://www.3andena.com/,这个网站首先以阿拉伯语开始,它将语言设置存储在cookies中。如果试图通过URL(http://www.3andena.com/home.php?sl=en)直接访问语言版本,则会产生问题并返回服务器错误。
所以,我想将cookie值“store_language”设置为“en”,然后开始废弃使用此cookie值的网站。
我使用爬行蜘蛛有一些规则。
这是密码
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re
class AndenaSpider(CrawlSpider):
name = "andena"
domain_name = "3andena.com"
start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]
product_urls = []
rules = (
# The following rule is for pagination
Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
# The following rule is for produt details
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
)
def start_requests(self):
yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})
for url in self.start_urls:
yield Request(url, callback=self.parse_category)
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())
for product in self.product_urls:
yield Request(product, callback=self.parse_product)
def parse_product(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = Product()
'''
some parsing
'''
items.append(item)
return items
SPIDER = AndenaSpider()
这是日志:
2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)
修改代码如下:
请求对象接受可选的
cookies
关键字参数,see documentation here直接从Scrapy documentation for Requests and Responses.
你需要这样的东西
这就是我从0.24.6开始的做法:
Scrapy使用spider的
start_urls
属性中的url调用make_requests_from_url
。上面的代码所做的是让默认实现创建请求,然后添加一个值为bar
的foo
cookie。(或者将cookie更改为值bar
,如果发生这种情况,则很可能默认实现生成的请求上已经存在一个foo
cookie。)如果您想知道从
start_urls
创建的而不是请求会发生什么情况,让我添加Scrapy的cookie中间件将记住上面代码设置的cookie,并将其设置在与您显式添加cookie的请求共享同一域的所有未来请求上。相关问题 更多 >
编程相关推荐