我试图从ajax中提取数据(标题、价格和描述),但即使通过更改用户代理也无法实现
链接:https://scrapingclub.com/exercise/detail_header/
Ajax(要提取的数据):https://scrapingclub.com/exercise/ajaxdetail_header/
import scrapy
class UseragentSpider(scrapy.Spider):
name = 'useragent'
allowed_domains = ['scrapingclub.com/exercise/ajaxdetail_header/']
start_urls = ['https://scrapingclub.com/exercise/ajaxdetail_header/']
user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36"
def parse(self, response):
cardb= response.xpath("//div[@class='card-body']")
for thing in cardb:
title= thing.xpath(".//h3")
yield {'title' : title}
错误日志:
2020-09-07 20:34:39 [scrapy.core.engine] INFO: Spider opened
2020-09-07 20:34:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-07 20:34:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-07 20:34:40 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://scrapingclub.com/robots.txt> (referer: None)
2020-09-07 20:34:40 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://scrapingclub.com/exercise/ajaxdetail_header/> (referer: None)
2020-09-07 20:34:40 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://scrapingclub.com/exercise/ajaxdetail_header/>: HTTP status code is not handled or not allowed
AJAX
请求应发送标头但并非所有服务器都会检查它。但是这个服务器检查一下。但是它不检查
User-Agent
服务器以
JSON
的形式发送数据,因此xpath
将是无用的我用
requests
而不是scrapy
测试它,因为它对我来说更简单结果:
编辑:
与
Scrapy
相同。我使用函数start_requests()
创建带有头'X-Requested-With'
的Request()
您可以将所有代码放在一个文件中并运行
python script.py
,而无需创建项目编辑:
使用设置DEFAULT_REQUEST_HEADERS也一样
相关问题 更多 >
编程相关推荐