在Scrapy Splash中使用Crawlera lua脚本时，如何获取session_id？

function use_crawlera(splash) -- Make sure you pass your Crawlera API key in the 'crawlera_user' arg. -- Have a look at the file spiders/quotes-js.py to see how to do it. -- Find your Crawlera credentials in https://app.scrapinghub.com/ local user = splash.args.crawlera_user local host = 'proxy.crawlera.com' local port = 8010 local session_header = 'X-Crawlera-Session' local session_id = 'create' splash:on_request(function (request) request:set_header('X-Crawlera-Cookies', 'disable') request:set_header(session_header, session_id) request:set_proxy{host, port, username=user, password=''} end) splash:on_response_headers(function (response) if type(response.headers[session_header]) ~= nil then session_id = response.headers[session_header] end end) end function main(splash) use_crawlera(splash) splash:init_cookies(splash.args.cookies) assert(splash:go{ splash.args.url, headers=splash.args.headers, http_method=splash.args.http_method, }) assert(splash:wait(3)) return { html = splash:html(), cookies = splash:get_cookies(), } end

2条回答

网友

1楼 · 编辑于 2024-05-23 20:15:02

在lua脚本中还返回HAR数据（https://splash.readthedocs.io/en/stable/scripting-ref.html#splash-har）：

    return {
        html = splash:html(),
        har = splash:har(),
        cookies = splash:get_cookies(),
    }

假设您使用的是scrapy splash（https://github.com/scrapy-plugins/scrapy-splash），请确保为您的请求设置了execute端点：

meta['splash']['endpoint'] = 'execute'。在

如果使用scrapy.Request，渲染器.json是默认端点，但对于scrapy_splash.SplashRequest，默认端点是渲染.html. 看看这两个例子，看看如何设置端点：https://github.com/scrapy-plugins/scrapy-splash#requests

现在，您才有权访问parse方法中的X-Crawlera-Session头：

^{pr2}$

>>> headers = json.loads(response.text)['har']['log']['entries'][0]['response']['headers']
>>> next(x for x in headers if x['name'] == 'X-Crawlera-Session')
{u'name': u'X-Crawlera-Session', u'value': u'2124641382'}

网友

2楼 · 编辑于 2024-05-23 20:15:02

使用splash:set_result_header。在

相关问题更多 >

编程相关推荐

热门问题

热门文章