如何修改字典输出

2024-06-09 10:31:50 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在用Scrapy刮一个页面的新闻,它基本上是一个标题、元文本和文本摘要。代码实际上运行良好,但我对字典输出有问题。输出首先显示所有标题,然后显示所有元文本,最后显示所有文本摘要。但我需要的是一个又一个的新闻标题,元文本和文本摘要。我猜for循环或选择器出了问题

谢谢你的帮助

我的代码:

import scrapy
class testspider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://oilprice.com/Latest-Energy-News/World-News']    

    def parse(self, response):
        all_news = response.xpath('//div[@class="tableGrid__column tableGrid__column--articleContent category"]')

        for singlenews in all_news:         
            title_item = singlenews.xpath('//div[@class="categoryArticle__content"]//a//text()').extract()
            meta_item = singlenews.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__meta"]//text()').extract()
            extract_item = singlenews.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__excerpt"]//text()').extract()      

            yield {
                'title_data' : title_item,
                'meta_data' :  meta_item,
                'extract_data' : extract_item        
            }

输出:

{'title_data': ['Global Energy-Related CO2 Emissions Stopped Rising In 2019', 'BHP
 Is Now The World’s Top Copper Miner', 'U.S. Budget Proposal Includes Sale Of 15 
Mln Barrels Strategic Reserve Oil', ... , '**meta_data**': ['Feb 11, 2020 at 12:02
 | Tsvetana Paraskova', 'Feb 11, 2020 at 11:27 | MINING.com ', 'Feb 11, 2020 at 
09:59 | Irina Slav', ... , '**extract_data**': ['The world’s energy-related carbon
 dioxide (CO2) emissions remained flat in 2019, halting two years of emissions 
increases, as lower emissions in advanced economies offset growing emissions
 elsewhere, the International Energy…', 'BHP Group on Monday became the world’s 
largest copper miner based on production after Chile’s copper commission announced 
a slide in output at state-owned Codelco.\r\nHampered by declining grades 
Codelco…', 'The budget proposal President Trump released yesterday calls for the 
sale of 15 million barrels of oil from the Strategic Petroleum Reserve of the 
United States.\r\nThe proceeds from the…', ... , ']}

Tags: thein文本divdatatitleextractitem
2条回答

在Xpath中使用//时,将在整个文档中执行搜索,然后

title_item = singlenews.xpath('//div[@class="categoryArticle__content"]//a//text()').extract()

将返回一个列表,其中包含与此筛选器匹配的div中的所有文本div[@class="categoryArticle__content]

您需要做的是筛选相对路径singlenews,请尝试以下操作:

title_item = singlenews.xpath('./div[@class="categoryArticle__content"]//a//text()').extract()

参考:https://devhints.io/xpath

从输出中,您的代码似乎同时提取了titlemeta_dataextract_data并将其保存在一个字典中。如果你想为你正在抓取的网站上的每一条新闻条目创建一本词典,你应该首先获得你需要的所有数据,然后根据你的喜好将其解析成词典。所以你的代码看起来像这样

def parse(self, response):
    all_news = response.xpath('//div[@class="tableGrid__column tableGrid__column articleContent category"]')  
    titles = all_news.xpath('//div[@class="categoryArticle__content"]//a//text()').extract()
    meta_items = all_news.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__meta"]//text()').extract()
    extract_items = all_news.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__excerpt"]//text()').extract()      

    # at this point titles, meta_items and extract_items should be 3 concurrent lists of the same length and now you can parse them as you need

    news_items = []
    for i in range(len(titles)): 
        news = { 'title': titles[i], 'meta_data': meta_items[i], 'extract_data': extract_items[i] }
        news_items.append(news)
    return news_items

这将返回您希望的新闻帖子

相关问题 更多 >