零碎的数据流和项目和项目加载器 - 问答 - Python中文网

零碎的数据流和项目和项目加载器

2024-04-24 05:35:56 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我正在查看Scrapy文档中的Architecture Overview页，但是我仍然有一些关于数据和/或控制流的问题。你知道吗

脏兮兮的架构

残缺项目的默认文件结构

scrapy.cfg
myproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
    ...

项目.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class MyprojectItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

我想，这会变成

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

因此，在尝试填充Product实例的未声明字段时会引发错误

>>> product = Product(name='Desktop PC', price=1000)
>>> product['lala'] = 'test'
Traceback (most recent call last):
    ...
KeyError: 'Product does not support field: lala'

问题1

如果我们在items.py中创建了class CrowdfundingItem，那么我们的爬虫在哪里、何时以及如何意识到items.py？你知道吗

这是在。。。你知道吗

__init__.py？你知道吗
my_crawler.py？你知道吗
^mycrawler.py的{}？你知道吗
settings.py？你知道吗
pipelines.py？你知道吗
^pipelines.py的{}？你知道吗
在别的地方？你知道吗

问题2

一旦我声明了一个项目，比如Product，那么如何通过在类似于下面的上下文中创建Product的实例来存储数据？你知道吗

import scrapy

class MycrawlerSpider(CrawlSpider):
    name = 'mycrawler'
    allowed_domains = ['google.com']
    start_urls = ['https://www.google.com/']
    def parse(self, response):
        options = Options()
        options.add_argument('-headless')
        browser = webdriver.Firefox(firefox_options=options)
        browser.get(self.start_urls[0])
        elements = browser.find_elements_by_xpath('//section')
        count = 0
        for ele in elements:
             name = browser.find_element_by_xpath('./div[@id="name"]').text
             price = browser.find_element_by_xpath('./div[@id="price"]').text

             # If I am not sure how many items there will be,
             # and hence I cannot declare them explicitly,
             # how I would go about creating named instances of Product?

             # Obviously the code below will not work, but how can you accomplish this?

             count += 1
             varName + count = Product(name=name, price=price)
             ...

最后，假设我们完全放弃命名Product实例，而只是创建未命名的实例。你知道吗

for ele in elements:
    name = browser.find_element_by_xpath('./div[@id="name"]').text
    price = browser.find_element_by_xpath('./div[@id="price"]').text
    Product(name=name, price=price)

如果这些实例确实存储在某个地方，那么它们存储在哪里？通过这种方式创建实例，是否不可能访问它们？你知道吗

Tags：实例 name py browser field for by items

1条回答

网友
1楼 · 发布于 2024-04-24 05:35:56

使用Item是可选的；它们只是声明数据模型和应用验证的方便方法。也可以使用普通的dict。你知道吗
如果选择使用Item，则需要导入它以便在spider中使用。它不是自动发现的。就你而言：
from items import CrowdfundingItem
当spider在每个页面上运行parse方法时，您可以将提取的数据加载到Item或dict中。一旦它被加载，yield它就会返回到scrapy引擎，以便在管道或出口商中进行下游处理。这就是scrapy如何“存储”所刮取的数据。你知道吗
例如：
yield Product(name='Desktop PC', price=1000) # uses Item yield {'name':'Desktop PC', 'price':1000} # plain dict

相关问题更多 >

编程相关推荐

热门问题

热门文章