将Excel工作表中的原始URL包含在剪贴输出中

2024-06-11 23:37:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在用Scrapy抓取一些页面。我参考了一个excel表格来查找起始URL,我希望那些精确的起始URL出现在结果中,而不是重定向的URL。我需要原件以便处理Excel查找

问题是,我似乎只能得到一个输出,它给出了目标url

我的代码如下

from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom5.items import ICcom5Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
from scrapy.http import Request
from scrapy.loader import ItemLoader
from scrapy.item import Item, Field
import requests
import csv
import sys

class MySpider(Spider):
    name = "ICcom5"
    start_urls = [l.strip() for l in open('items5.csv').readlines()]

    def parse(self, response):
        item = Item()
        titles = response.xpath('//div[@class="jobsearch-JobMetadataFooter"]')
        items = []
        for titles in titles:
            item = ICcom5Item()
            home_url = ("http://www.indeed.co.uk")
            item ['_pageURL'] = response.request.url
            item ['description'] = ' '.join(titles.xpath('//div[@class="jobsearch-jobDescriptionText"]//text()').extract())
            item ['role_title_link'] = titles.xpath('//span[@id="originalJobLinkContainer"]/a/@href').extract()
            items.append(item)
        return items

非常简单的代码,但我很难理解我能从这些残缺的文档中做些什么


我已经根据建议修改了代码,但仍然无法从源电子表格中获得原始URL。示例URL如下所示

https://www.indeed.co.uk/rc/clk?jk=a47eb72131f3d588&fccid=c7414b794cb89c1c&vjs=3
https://www.indeed.co.uk/rc/clk?jk=8c7f045caddb116d&fccid=473601b0f30a6c9c&vjs=3
https://www.indeed.co.uk/company/Agilysts-Limited/jobs/Back-End-Java-Developer-3ec6efc3ebc256c5?fccid=d1f7896a8bd9f15e&vjs=3

Tags: 代码fromimporturlresponsewwwitemsitem
2条回答

这是我最后的工作代码,在TomášLinhart的帮助下

class MySpider(Spider):
    name = "ICcom5"
    start_urls = [l.strip() for l in open('items5.csv').readlines()]
    def parse(self, response):
        item = Item()

        for titles in response.xpath('//div[@class="jobsearch-JobMetadataFooter"]'):
            items = []
            item = ICcom5Item()
            redirect_urls = response.request.meta.get('redirect_urls')
            item['_pageURL'] = redirect_urls[0] if redirect_urls else response.request.url
            item ['description'] = ' '.join(titles.xpath('//div[@class="jobsearch-jobDescriptionText"]//text()').extract())
            item ['role_title_link'] = titles.xpath('//span[@id="originalJobLinkContainer"]/a/@href').extract()

            items.append(item)
        return items

您可以在parse函数中使用response.request.url来获取您请求的原始URL

更新:我要么理解错了the documentation,要么是个bug。具体地

HTTP redirections will cause the original request (to the URL before redirection) to be assigned to the redirected response (with the final URL after redirection)

我真的认为原始请求URL应该在response.request.url下可用

无论如何,正如^{} documentation中所述,有一种替代方法。您可以使用request.metaredirect_urls键获取请求经过的URL列表。下面是作为PoC的代码的修改(简化)版本:

# -*- coding: utf-8 -*-
import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = [
        'https://www.indeed.co.uk/rc/clk?jk=a47eb72131f3d588&fccid=c7414b794cb89c1c&vjs=3',
        'https://www.indeed.co.uk/rc/clk?jk=8c7f045caddb116d&fccid=473601b0f30a6c9c&vjs=3',
        'https://www.indeed.co.uk/company/Agilysts-Limited/jobs/Back-End-Java-Developer-3ec6efc3ebc256c5?fccid=d1f7896a8bd9f15e&vjs=3'
    ]

    def parse(self, response):
        for title in response.xpath('//div[@class="jobsearch-JobMetadataFooter"]'):
            item = {}
            redirect_urls = response.request.meta.get('redirect_urls')
            item['_pageURL'] = redirect_urls[0] if redirect_urls else response.request.url
            item['description'] = ' '.join(title.xpath('//div[@class="jobsearch-jobDescriptionText"]//text()').extract())
            item['role_title_link'] = title.xpath('//span[@id="originalJobLinkContainer"]/a/@href').extract()
            yield item

另外,请注意,您提供的原始代码还存在一些其他问题,特别是:

  • parse方法中,返回的itemslist,但只允许dict(或ItemRequest
  • for titles in titles:可能做了一些您不想做的事情

相关问题 更多 >