是否将数据返回到上一个回调函数?

2024-05-19 21:56:54 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试解析比赛中各种运动队的推特页面。要解析twitter,我首先要从一个包含所有其他锦标赛链接的网页开始,然后转到该锦标赛所有球队的网页,然后转到球队网页获取twitter。我在进入球队网页时遇到了麻烦,因为我不确定如何将twitter名称返回到上一个回调函数,以便我可以将该锦标赛中的所有twitter名称放入列表中

在我的最后一个回调函数parse_twitter中,我尝试将结果作为字典返回,然后将其添加到parse_计划中的项中,但运气不太好

def parse(self, response):
    # Get list of tournaments
    tournaments = Selector(response).xpath('//td/a')
    del tournaments[0]

    # Go through each tournament
    for tourney in tournaments:
        item = FrisbeeItem()
        item['tournament_name'] = tourney.xpath('./text()').extract()[0]
        item['tournament_url'] = tourney.xpath('./@href').extract()[0]

        # make the URL to the teams in the tournament
        tournament_schedule = item['tournament_url'] + '/schedule/Men/CollegeMen/'

        # Request to tournament page
        yield scrapy.Request(url=tournament_schedule, callback=self.parse_schedule, meta={'item' : item})

def parse_schedule(self, response):
    item = response.meta.get('item')

    # Get the list of teams
    tourney_teams = Selector(response).xpath('//div[@class = "pool"]//td/a')

    # For each team in the tournament, get name and URL to team page
    for team in tourney_teams:
        team_name = team.xpath('./text()').extract()[0]
        team_url = 'https://play.usaultimate.org/' + team.xpath('./@href').extract()[0]

        # Request to team page
        yield scrapy.Request(url=team_url, callback=self.parse_twitter, meta={'item': item, 'team_name': team_name})



def parse_twitter(self, response):
    item = response.meta.get('item')
    team_name = response.meta.get('team_name')

    result = {}
    # Get the list containing the twitter
    team_twitter = Selector(response).xpath('//dl[@id="CT_Main_0_dlTwitter"]//a/text()').extract()

    #If a twitter is not listed, put empty string
    if len(team_twitter) == 0:
        result = {'name': team_name, 'twitter': ''}
    else:
        result = {'name': team_name, 'twitter': team_twitter[0]}

    item['tournament_teams'] = result

    yield item

我想要接近以下格式的文件:

    {'tournament_name: X,
     'teams': [{'team_name': team1, 'twitter_name': twitter1},
               {'team_name': team2, 'twitter_name': twitter2},
               {'team_name': team3, 'twitter_name': twitter3},
               ...]
     }
    {'tournament_name: Y,
     'teams': [{'team_name': team1, 'twitter_name': twitter1},
               {'team_name': team2, 'twitter_name': twitter2},
               {'team_name': team3, 'twitter_name': twitter3},
               ...]
     }

所以基本上每个比赛只有一个项目,包含比赛中每个球队的名字和推特

现在,根据我列出的代码,它为每个队的网页吐出1个项目(每个锦标赛中每个队一个项目)


Tags: thenameselfurl网页parseresponseextract