我正在用Scrapy刮一个页面的新闻,它基本上是一个标题、元文本和文本摘要。代码实际上运行良好,但我对字典输出有问题。输出首先显示所有标题,然后显示所有元文本,最后显示所有文本摘要。但我需要的是一个又一个的新闻标题,元文本和文本摘要。我猜for循环或选择器出了问题
谢谢你的帮助
我的代码:
import scrapy
class testspider(scrapy.Spider):
name = 'test'
start_urls = ['https://oilprice.com/Latest-Energy-News/World-News']
def parse(self, response):
all_news = response.xpath('//div[@class="tableGrid__column tableGrid__column--articleContent category"]')
for singlenews in all_news:
title_item = singlenews.xpath('//div[@class="categoryArticle__content"]//a//text()').extract()
meta_item = singlenews.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__meta"]//text()').extract()
extract_item = singlenews.xpath('//div[@class="categoryArticle__content"]//p[@class="categoryArticle__excerpt"]//text()').extract()
yield {
'title_data' : title_item,
'meta_data' : meta_item,
'extract_data' : extract_item
}
输出:
{'title_data': ['Global Energy-Related CO2 Emissions Stopped Rising In 2019', 'BHP
Is Now The World’s Top Copper Miner', 'U.S. Budget Proposal Includes Sale Of 15
Mln Barrels Strategic Reserve Oil', ... , '**meta_data**': ['Feb 11, 2020 at 12:02
| Tsvetana Paraskova', 'Feb 11, 2020 at 11:27 | MINING.com ', 'Feb 11, 2020 at
09:59 | Irina Slav', ... , '**extract_data**': ['The world’s energy-related carbon
dioxide (CO2) emissions remained flat in 2019, halting two years of emissions
increases, as lower emissions in advanced economies offset growing emissions
elsewhere, the International Energy…', 'BHP Group on Monday became the world’s
largest copper miner based on production after Chile’s copper commission announced
a slide in output at state-owned Codelco.\r\nHampered by declining grades
Codelco…', 'The budget proposal President Trump released yesterday calls for the
sale of 15 million barrels of oil from the Strategic Petroleum Reserve of the
United States.\r\nThe proceeds from the…', ... , ']}
在Xpath中使用
//
时,将在整个文档中执行搜索,然后将返回一个列表,其中包含与此筛选器匹配的div中的所有文本
div[@class="categoryArticle__content]
您需要做的是筛选相对路径
singlenews
,请尝试以下操作:参考:https://devhints.io/xpath
从输出中,您的代码似乎同时提取了
title
、meta_data
和extract_data
并将其保存在一个字典中。如果你想为你正在抓取的网站上的每一条新闻条目创建一本词典,你应该首先获得你需要的所有数据,然后根据你的喜好将其解析成词典。所以你的代码看起来像这样这将返回您希望的新闻帖子
相关问题 更多 >
编程相关推荐