Scrapy不退回任何废料

# -*- coding: utf-8 -*- import scrapy class EeentertainmentSpider(scrapy.Spider): name = 'eeentertainment' allowed_domains = ['www.entertainmentearth.com/exclusives.asp'] start_urls = ['http://www.entertainmentearth.com/exclusives.asp/'] def parse(self, response): #Extracting the content using css selectors titles = response.css('.title::text').extract() #Give the extracted content row wise for item in zip(titles): #create a dictionary to store the scraped info scraped_info = { 'title' : item[0], } #yield or give the scraped info to scrapy yield scraped_info pass

1条回答

网友

1楼 · 发布于 2024-04-25 08:52:12

你的蜘蛛有几个问题：

allowed_domains列表应该只包含域名，而不是确切的url（参见documentation）
start_urls中的URL后面有一个/（应该是http://www.entertainmentearth.com/exclusives.asp）
我不知道你想用这里的zip做什么，但我几乎可以肯定这不是有意的
^parse方法末尾的{}是多余的

根据我提供的屏幕截图可以看出，您正试图从页面中刮取图像标题。为此，考虑到上述注意事项，请参阅适用的spider代码：

# -*- coding: utf-8 -*-
import scrapy

class EeentertainmentSpider(scrapy.Spider):
    name = 'eeentertainment'
    allowed_domains = ['entertainmentearth.com']
    start_urls = ['http://www.entertainmentearth.com/exclusives.asp']

    def parse(self, response):
        titles = response.css('img::attr(title)').extract()
        for title in titles:
            scraped_info = {
                'title' : title,
            }
            yield scraped_info

相关问题更多 >

编程相关推荐

热门问题

热门文章