如何从网页中的多个表格抓取内容
我想从一个网页的多个表格中提取内容,HTML代码大概是这样的:
<div class="fixtures-table full-table-medium" id="fixtures-data">
<h2 class="table-header"> Date 1 </h2>
<table class="table-stats">
<tbody>
<tr class='preview' id='match-row-EFBO755307'>
<td class='details'>
<p>
<span class='team-home teams'>
<a href='random_team'>team 1</a>
</span>
<span class='team-away teams'>
<a href='random_team'>team 2</a>
</span>
</p>
</td>
</tr>
<tr class='preview' id='match-row-EFBO755307'>
<td class='match-details'>
<p>
<span class='team-home teams'>
<a href='random_team'>team 3</a>
</span>
<span class='team-away teams'>
<a href='random_team'>team 4</a>
</span>
</p>
</td>
</tr>
</tbody>
</table>
<h2 class="table-header"> Date 2 </h2>
<table class="table-stats">
<tbody>
<tr class='preview' id='match-row-EFBO755307'>
<td class='match-details'>
<p>
<span class='team-home teams'>
<a href='random_team'>team X</a>
</span>
<span class='team-away teams'>
<a href='random_team'>team Y</a>
</span>
</p>
</td>
</tr>
<tr class='preview' id='match-row-EFBO755307'>
<td class='match-details'>
<p>
<span class='team-home teams'>
<a href='random_team'>Team A</a>
</span>
<span class='team-away teams'>
<a href='random_team'>Team B</a>
</span>
</p>
</td>
</tr>
</tbody>
</table>
</div>
在日期下面还有更多的比赛(根据当天进行的比赛数量,可能是9场、2场或1场),而且表格的数量是63个(这和天数是一样的)。
我想提取每个日期的比赛信息,包括哪支队伍是主场队,哪支是客场队。
我在使用scrapy的命令行工具,并尝试了以下命令:
title = sel.xpath("//td[@class = 'match-details']")[0]
l_home = title.xpath("//span[@class = 'team-home teams']/a/text()").extract()
这个命令打印出了主场队伍的列表,而这个命令打印出了所有客场队伍的列表,
l_Away = title.xpath("//span[@class = 'team-away teams']/a/text()").extract()
这个命令给了我所有日期的列表:
sel.xpath("/html/body/div[3]/div/div/div/div[4]/div[2]/div/h2/text()").extract()
我想要的是,针对每个日期获取当天进行的比赛(还要知道哪支队伍是主场,哪支是客场)。
我的items.py文件应该是这样的:
date = Field()
home_team = Field()
away_team2 = Field()
请帮我写一下parse函数和Item类。
提前谢谢你。
1 个回答
3
这里有一个来自 scrapy shell
的逻辑示例:
>>> for table in response.xpath('//table[@class="table-stats"]'):
... date = table.xpath('./preceding-sibling::h2[1]/text()').extract()[0]
... print date
... for match in table.xpath('.//tr[@class="preview" and @id]'):
... home_team = match.xpath('.//span[@class="team-home teams"]/a/text()').extract()[0]
... away_team = match.xpath('.//span[@class="team-away teams"]/a/text()').extract()[0]
... print home_team, away_team
...
Date 1
team 1 team 2
team 3 team 4
Date 2
team X team Y
Team A Team B
在 parse()
方法中,你需要在内部循环里创建一个 Item
的实例,并使用 yield
返回它:
def parse(self, response):
for table in response.xpath('//table[@class="table-stats"]'):
date = table.xpath('./preceding-sibling::h2[1]/text()').extract()[0]
for match in table.xpath('.//tr[@class="preview" and @id]'):
home_team = match.xpath('.//span[@class="team-home teams"]/a/text()').extract()[0]
away_team = match.xpath('.//span[@class="team-away teams"]/a/text()').extract()[0]
item = MyItem()
item['date'] = date
item['home_team'] = home_team
item['away_team'] = away_team
yield item
其中 Myitem
会是:
class MyItem(Item):
date = Field()
home_team = Field()
away_team = Field()