如您所知,当我们尝试在Crawlera中使用Scrapy Splash时,我们使用以下lua脚本:
function use_crawlera(splash)
-- Make sure you pass your Crawlera API key in the 'crawlera_user' arg.
-- Have a look at the file spiders/quotes-js.py to see how to do it.
-- Find your Crawlera credentials in https://app.scrapinghub.com/
local user = splash.args.crawlera_user
local host = 'proxy.crawlera.com'
local port = 8010
local session_header = 'X-Crawlera-Session'
local session_id = 'create'
splash:on_request(function (request)
request:set_header('X-Crawlera-Cookies', 'disable')
request:set_header(session_header, session_id)
request:set_proxy{host, port, username=user, password=''}
end)
splash:on_response_headers(function (response)
if type(response.headers[session_header]) ~= nil then
session_id = response.headers[session_header]
end
end)
end
function main(splash)
use_crawlera(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
})
assert(splash:wait(3))
return {
html = splash:html(),
cookies = splash:get_cookies(),
}
end
在lua脚本中有一个session_id
变量,我非常需要它,但是如何从Scrapy的响应中访问它呢?在
我试过response.session_id
或{
execute
端点:meta['splash']['endpoint'] = 'execute'
。在如果使用
scrapy.Request
,渲染器.json是默认端点,但对于scrapy_splash.SplashRequest
,默认端点是渲染.html. 看看这两个例子,看看如何设置端点:https://github.com/scrapy-plugins/scrapy-splash#requests- 现在,您才有权访问parse方法中的
^{pr2}$X-Crawlera-Session
头:使用splash:set_result_header。在
相关问题 更多 >
编程相关推荐