刮擦多个字段

2024-03-28 22:57:21 发布

您现在位置:Python中文网/ 问答频道 /正文

非常新的和没有经验的程序员在这里!在

我正在建立一个垃圾项目,可以刮这个网站的公司名称和地点,并输出一个JSON文件。在

https://www.f6s.com/programs?type[]=accelerator&page=93&page_alt=1&sort=open

目前,我的铲运机正在拉公司名称,但也在拉日期。此外,JSON输出分为几个部分,首先是公司名称列表,然后是位置列表(包含我不需要的附加信息)。在

如何将公司名称/位置拉出来并格式化,以便可以将每个公司名称与特定位置关联起来?在

我认为我的问题是位置没有被定义为一个特定的类。在

另外,对于如何设置JSON输出格式的建议,我们将不胜感激!!在


我的项目目录:

`myproject`/
    scrapy.cfg           

    __init__.py

    items.py          

    pipelines.py      

    settings.py       

    spiders/         
        __init__.py
          byub.py
          F6sSpider.py

我的蜘蛛文件:

^{pr2}$

我的终端

^{3}$

我的JSON输出:

[
{"program": ["K - LAUNCHPAD 2018"]},
{"program": ["Z Nation Lab Real Estate Cohort"]},
{"program": ["C-mint-International"]},
{"program": ["StartOut Growth Lab - 2018 Fall Cohort"]},
{"program": ["IBA Application"]},
{"program": ["WATT Factory Accelerator Programme 2018"]},
{"program": ["AdvantEdge Founder's Adda"]},
{"program": ["SpinLab - The HHL Accelerator"]},
{"program": ["Shell LiveWIRE Accelerator"]},
{"program": ["Shell France Accelerator "]},
{"program": ["ELEVATE by TheVentury"]},
{"program": ["F6S R&D Money Back"]},
{"location": ["\n                    Jun 1-Jul 20                         \u2022\n                    Berlin, Germany    \n                "]},
{"location": ["\n                    Mumbai, India    \n                "]},
{"location": ["\n                    Atlanta, United States    \n                "]},
{"location": ["\n                    Jul 8-Dec 31                         \u2022\n                    San Francisco, United States    \n                "]},
{"location": ["\n                    Mar 19-May 16                         \u2022\n                    Los Angeles, United States    \n                "]},
{"location": ["\n                    Jun 3-Nov 30                         \u2022\n                    Gent, Belgium    \n                "]},
{"location": ["\n                    Delhi, India    \n                "]},
{"location": ["\n                    Leipzig, Germany    \n                "]},
{"location": ["\n                    Jun 20 '18-Jun 21  '19                        \u2022\n                    Paris, France    \n                "]},
{"location": ["\n                    Jun 20 '18-Jun 1  '19                        \u2022\n                    Paris, France    \n                "]},
{"location": ["\n                    Sep 5 '18-Feb 14  '19                        \u2022\n                    Vienna, Austria    \n                "]},
{"location": ["\n                    London, United Kingdom    \n                "]}
]

谢谢!在


Tags: 文件项目py名称jsonpage公司location
2条回答

在日期即将到来的特定情况下,您需要格式化位置,例如

import scrapy

class CompanySpider(scrapy.Spider):
    name = "Company"

    def start_requests(self):
        urls = [
            'https://www.f6s.com/programs?type[]=accelerator&sort=open',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for title in response.css('div.result-description'):
            location = title.css('div.subtitle span::text').extract_first()
            # remove "." from some locations and split by lines
            location = location.encode("ascii","ignore").splitlines()
            # get second last element and remove whitespaces
            location = location[-2].strip()
            yield {
                'program': title.css('div.title a.action.main.noline::text').extract_first(),
                'location': location
            }

输出:

^{pr2}$

看起来您不需要两个for循环,您可以使用一个:

def parse(self, response):
    for title in response.css('div.result-description'):
        yield {
            'program': title.css('div.title a.action.main.noline::text').extract(),
            'location': title.css('div.subtitle span::text').extract(),
        }

相关问题 更多 >