用废管道加工物料

2024-04-25 09:34:26 发布

您现在位置:Python中文网/ 问答频道 /正文

我在用Python脚本运行Scrapy

我被告知,在Scrapy中,response是在parse()中构建的,并在pipeline.py中进一步处理

到目前为止,我的框架是这样设置的:

Python脚本

def script(self):

        process = CrawlerProcess(get_project_settings())

        response = process.crawl('pitchfork_albums', domain='pitchfork.com')

        process.start() # the script will block here until the crawling is finished

蜘蛛网

class PitchforkAlbums(scrapy.Spider):
    name = "pitchfork_albums"
    allowed_domains = ["pitchfork.com"]
    #creates objects for each URL listed here
    start_urls = [
                    "http://pitchfork.com/reviews/best/albums/?page=1",
                    "http://pitchfork.com/reviews/best/albums/?page=2",
                    "http://pitchfork.com/reviews/best/albums/?page=3"                   
    ]
    def parse(self, response):

        for sel in response.xpath('//div[@class="album-artist"]'):
            item = PitchforkItem()
            item['artist'] = sel.xpath('//ul[@class="artist-list"]/li/text()').extract()
            item['album'] = sel.xpath('//h2[@class="title"]/text()').extract()

        yield item

items.py

class PitchforkItem(scrapy.Item):

    artist = scrapy.Field()
    album = scrapy.Field()

设置.py

ITEM_PIPELINES = {
   'blogs.pipelines.PitchforkPipeline': 300,
}

管道。py

class PitchforkPipeline(object):

    def __init__(self):
        self.file = open('tracks.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        for i in item:
            return i['album'][0]

如果我只是return itempipelines.py中,我会得到这样的数据(每个html页面有一个response):

{'album': [u'Sirens',
           u'I Had a Dream That You Were Mine',
           u'Sunergy',
           u'Skeleton Tree',
           u'My Woman',
           u'JEFFERY',
           u'Blonde / Endless',
           u' A Mulher do Fim do Mundo (The Woman at the End of the World) ',
           u'HEAVN',
           u'Blank Face LP',
           u'blackSUMMERS\u2019night',
           u'Wildflower',
           u'Freetown Sound',
           u'Trans Day of Revenge',
           u'Puberty 2',
           u'Light Upon the Lake',
           u'iiiDrops',
           u'Teens of Denial',
           u'Coloring Book',
           u'A Moon Shaped Pool',
           u'The Colour in Anything',
           u'Paradise',
           u'HOPELESSNESS',
           u'Lemonade'],
 'artist': [u'Nicolas Jaar',
            u'Hamilton Leithauser',
            u'Rostam',
            u'Kaitlyn Aurelia Smith',
            u'Suzanne Ciani',
            u'Nick Cave & the Bad Seeds',
            u'Angel Olsen',
            u'Young Thug',
            u'Frank Ocean',
            u'Elza Soares',
            u'Jamila Woods',
            u'Schoolboy Q',
            u'Maxwell',
            u'The Avalanches',
            u'Blood Orange',
            u'G.L.O.S.S.',
            u'Mitski',
            u'Whitney',
            u'Joey Purp',
            u'Car Seat Headrest',
            u'Chance the Rapper',
            u'Radiohead',
            u'James Blake',
            u'White Lung',
            u'ANOHNI',
            u'Beyonc\xe9']}

我想在pipelines.py中做的是能够为每个item获取单个songs,如下所示:

[u'Sirens']

Tags: thepyselfcomforalbumartistresponse
1条回答
网友
1楼 · 发布于 2024-04-25 09:34:26

我建议您在spider中构建结构良好的item。在Scrapy框架工作流中,spider用于构建格式良好的项,例如解析html、填充项实例,管道用于对项执行操作,例如过滤项、存储项

对于您的应用程序,如果我理解正确,每个项目都应该是一个描述相册的条目。所以,在剥离html时,最好构建此类项,而不是将所有内容都聚集到项中

因此,在spider.pyparse函数中,您应该

  1. yield item语句放在for循环中,而不是放在之外。这样,每个相册将生成一个项目
  2. 在Scrapy中小心相对xpath选择器。如果要使用相对xpath选择器指定self和后代,请使用.//而不是//,要指定self,请使用./而不是/
  3. 理想情况下,相册标题应该是标量,相册艺术家应该是列表,所以尝试extract_first将相册标题设置为标量

    def parse(self, response):
    for sel in response.xpath('//div[@class="album-artist"]'):
        item = PitchforkItem()
        item['artist'] = sel.xpath('./ul[@class="artist-list"]/li/text()').extract_first()
        item['album'] = sel.xpath('./h2[@class="title"]/text()').extract()
        yield item
    

希望这会有帮助

相关问题 更多 >