无法使用scrapy从网页中提取javascript

2024-06-01 01:58:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在为一个我正在从事的数据科学项目从execute.com上提取数据。虽然我能够成功地抓取页面的各个部分,但是从页面的JSON部分抓取条目时遇到了一些问题

有人知道我如何从URL中提取下面的项目吗&燃气轮机&燃气轮机&燃气轮机view-source:https://www.indeed.com/viewjob?jk=41abec7fde3513dc&tk=1dn0mslbr352v000&from=serp&vjs=3&advn=9434814581076032&adid=197003786&sjdu=BbcXv7z69Xez4bal0Fx7iYB6jxzlBG3p6CfmfgjyGDErM4mqXgOsfEsOF5maJ2GRnKJsHskFl8aEbb4LlD5LibXOuIs0dzzHfVCmKB00C2c43rDVhEZX_8Zmg4zqEyqG5LEfQjRfoyOhULxXHTMitWOUjMOdLRt367-ZewSzfkqUSnPzHungl7uY7NcfOFLy

下面要提取的项目:\nPOT-Creation-Date:\nPO-Revision-Date:"jobLocation":"Arlington, TX

下面是我正在运行的一个示例脚本

import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
import boto3

class JobsSpider1(scrapy.Spider):
    name = "indeed"
    allowed_domains = ["indeed.com"]
    start_urls = ["https://www.indeed.com/jobs?q=\"owner+operator\"+\"truck\"&l=augusta"]

    custom_settings = {
    'FEED_FORMAT': 'json',
    'FEED_URI':'me_test.json'
    }

    def parse(self, response):
        jobs = response.xpath('//div[@class="title"]')

        for job in jobs:
            title = job.xpath('a//@title').extract_first()
            posting_link = job.xpath('a//@href').extract_first()
            posting_url = "https://indeed.com" + posting_link

            yield Request(posting_url, callback=self.parse_page, meta={'title': title, 'posting_url':posting_url})

        relative_next_url = response.xpath('//link[@rel="next"]/@href').extract_first()
        absolute_next_url = "https://indeed.com" + relative_next_url

        yield Request(absolute_next_url, callback=self.parse)

    def parse_page(self, response):
        posting_url = response.meta.get('posting_url')
        job_title = response.meta.get('title')

        #job_name= response.xpath('//*[@class="icl-u-xs-mb--xs icl-u-xs-mt--none  jobsearch-JobInfoHeader-title"]/text()').extract_first()
        job_descriptions=response.xpath('//*[@class="jobsearch-jobDescriptionText"]/ul').extract_first()
        job_listing_header=response.xpath('//*[@class="jobSectionHeader"]/ul').extract_first()
        posted_on_date= response.xpath('//*[@class="jobsearch-JobMetadataFooter"]/text()').extract_first()
        job_location=response.xpath('//*[@class="jobsearch-InlineCompanyRating icl-u-xs-mt--xs  jobsearch-DesktopStickyContainer-companyrating"]/div[3]/text()').extract_first()

        yield {
        'job_title':job_title,
        'posting_url':posting_url,
    #    'job_name':job_name,
        'job_listing_header':job_listing_header,
        'job_location': job_location,
        'job_descriptions':job_descriptions,
        'posted_on_date':posted_on_date
        }

Tags: httpscomurltitleresponsejobextractxpath