每个staru都有零碎的独立输出文件

`# -*- coding: utf-8 -*- import scrapy class AllCategoriesSpider(scrapy.Spider): name = 'vieles' allowed_domains = ['examplewiki.de'] start_urls = ['http://www.exampleregelwiki.de/index.php/categoryA.html','http://www.exampleregelwiki.de/index.php/categoryB.html','http://www.exampleregelwiki.de/index.php/categoryC.html',] #"Titel": : def parse(self, response): urls = response.css('a.ulSubMenu::attr(href)').extract() # links to den subpages for url in urls: url = response.urljoin(url) yield scrapy.Request(url=url,callback=self.parse_details) def parse_details(self,response): yield { "Titel": response.css("li.active.last::text").extract(), "Content": response.css('div.ce_text.first.last.block').extract(), }

1条回答

网友

1楼 · 发布于 2024-05-19 02:50:10

代码中没有使用真正的URL，所以我用我的页面进行测试。
我必须改变css选择器和我使用不同的字段。在

我将它保存为csv，因为它更容易附加数据。
JSON将需要从文件中读取所有项，添加新项并将所有项再次保存在同一文件中。在

我创建了一个额外的字段Category，以便以后在管道中使用它作为文件名

项目.py

import scrapy

class CategoryItem(scrapy.Item):
    Title = scrapy.Field()
    Date = scrapy.Field()
    # extra field use later as filename 
    Category = scrapy.Field()

在spider中，我从url获取类别并使用Request中的meta发送到parse_details。
在parse_details中，我将category添加到Item。在

蜘蛛/示例.py

^{pr2}$

在管道中，我得到category，并使用它打开文件以附加和保存项。在

管道.py

import csv

class CategoryPipeline(object):

    def process_item(self, item, spider):

        # get category and use it as filename
        filename = item['Category'] + '.csv'

        # open file for appending
        with open(filename, 'a') as f:
            writer = csv.writer(f)

            # write only selected elements 
            row = [item['Title'], item['Date']]
            writer.writerow(row)

            #write all data in row
            #warning: item is dictionary so item.values() don't have to return always values in the same order
            #writer.writerow(item.values())

        return item

在设置中，我必须取消对管道的注释才能激活它。在

设置.py

ITEM_PIPELINES = {
    'category.pipelines.CategoryPipeline': 300,
}

GitHub上的完整代码：python-examples/scrapy/save-categories-in-separated-files

顺便说一句：我想你可以直接在parse_details中写入文件。在

相关问题更多 >

编程相关推荐

热门问题

热门文章