使用Scrapy递归抓取页面中每个表的数据

0 投票
1 回答
774 浏览
提问于 2025-04-18 15:13

我在使用Python.org的2.7 64位版本,操作系统是Windows Vista 64位。我有一段代码可以从一个网页中抓取一个特定名称的表格:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags
from scrapy.cmdline import execute
import csv

filepath = "C:\\Python27\\Football Data\\test" + ".txt"

with open(filepath, "w") as f:
    f.write("")
    f.close()

class MySpider(Spider):

    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]    

    def parse(self, response):
        sel = Selector(response)

        titles = sel.xpath("normalize-space(//title)")
        print 'titles:', titles.extract()[0]

        rows = sel.xpath('//table[@id="player-fixture"]//tbody//tr')

        for row in rows:

            print 'date:', "".join( row.css('.date::text').extract() ).strip()
            print 'result:', "".join( row.css('.result a::text').extract() ).strip()
            print 'team_home:', "".join( row.css('.team.home a::text').extract() ).strip()
            print 'team_away:', "".join( row.css('.team.away a::text').extract() ).strip()
            print 'info:', "".join( row.css('.info::text').extract() ).strip(), "".join( row.css('.info::attr(title)').extract() ).strip()
            print 'rating:', "".join( row.css('.rating::text').extract() ).strip()
            print 'incidents:', ", ".join( row.css('.incidents-icon::attr(title)').extract() ).strip()
            print '-'*40

            date = "".join( row.css('.date::text').extract() ).strip() + ','
            result = "".join( row.css('.result a::text').extract() ).strip() + ','
            team_home = "".join( row.css('.team.home a::text').extract() ).strip() + ','
            team_away = "".join( row.css('.team.away a::text').extract() ).strip() + ','
            info = "".join( row.css('.info::text').extract() ).strip() + ','
            rating = "".join( row.css('.rating::text').extract() ).strip() + ','
            incident = " ".join( row.css('.incidents-icon::attr(title)').extract() ).strip() + ','

接下来,我还有一些代码可以从同一个网站的多个页面抓取文章的文本内容:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Articles"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=('/Articles',)), follow=True, callback='parse_item')]

    def parse_item(self,response):
        paragraphs = response.selector.xpath("//p").extract()
        text = "".join(remove_tags(paragraph).encode('utf-8') for paragraph in paragraphs)
        print text        


execute(['scrapy','crawl','goal3'])

不过,我真正想做的是从任何页面上抓取遇到的所有表格的数据。上面那段代码只会在抓取的页面中找到名为“player-fixture”的表格,但并不是每个页面都有这个表格。

在我开始翻找网站的HTML,看看哪些页面会有特定名称的表格之前,有没有办法让Scrapy抓取到遇到的任何表格的数据呢?

谢谢

1 个回答

0

如果你希望获取的 tableid 有不同的可能值,你可以在你的 xpath 里使用 or 操作符,这样就能捕捉到所有可能的情况。

比如说,你可以写成 '//table[@id="player-fixture" or @id="other-value"]//tbody//tr'

如果可能的值太多了,你可以尝试依靠一个更固定的变量,比如 div

例如,你可以这样写 //div[@att="value"]/table/tbody/tr

撰写回答