使用Scrapy时找不到源代码中显示的数据

2024-06-17 13:32:33 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Python.org网站Windows Vista 64位上的版本2.7 64位。我使用scray和regex的组合从一个名为'数据存储.Prime'在下一页:

http://www.whoscored.com/Regions/252/Tournaments/26/Seasons/4057/Stages/8273 我使用的爬虫程序是:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json


class ExampleSpider(CrawlSpider):
    name = "goal4"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Regions/252/Tournaments/26"]
    download_delay = 1

    #rules = [Rule(SgmlLinkExtractor(allow=('/Seasons',)), follow=True, callback='parse_item')]
    rules = [Rule(SgmlLinkExtractor(allow=('/Tournaments/26'),deny=('/News', '/Fixtures'),), follow=False, callback='parse_item')]

    def parse_item(self, response):

regex = re.compile('DataStore\.prime\(\'ws-stage-stat\', { stageId: \d+, type: \d+, teamId: -?\d+, against: \d+, field: \d+ }, \[\[\[.*?\]\]', re.S)

        match2h = re.search(regex, response.body)

        if match2h is not None:
            match3h = match2h.group()

            match3h = str(match3h)
            match3h = match3h \
                 .replace('title=', '').replace('"', '').replace("'", '').replace('[', '').replace(']', '') \
                 .replace(' ', ',').replace(',,', ',') \
                 .replace('[', '') \
                 .replace(']','') \
                 .replace("DataStore.prime", '') \
                 .replace('(', ''). replace('-', '').replace('wsstagestat,', '')

            match3h = re.sub("{.*?},", '', match3h)

我要看的是“足总杯赛程”下的赛程和比分。您可以使用页面上的日历选择所需的游戏周。如果你看一下源代码,它只包含了最近的比赛周(因为现在是上赛季,那是足总杯决赛)。在

此页的源代码中不包含前几周的数据。您使用的日历似乎正在生成一个名为:

^{pr2}$

这(如果我理解正确的话)似乎可以控制选择哪个游戏周进行显示。我想知道的是:

1)这种假设正确吗? 2) 源代码中是否有指向其他URL每周存储分数的地方(我很确定没有,但我对Javascript还很陌生)?在

谢谢


Tags: fromimportrecomparseitemrulereplace
1条回答
网友
1楼 · 发布于 2024-06-17 13:32:33

有一个XHR请求将加载fixture。模拟得到数据。在

例如,Jan 2014的装置:

from ast import literal_eval
from datetime import datetime
import requests

date = datetime(year=2014, month=1, day=1)
url = 'http://www.whoscored.com/tournamentsfeed/8273/Fixtures/'

params = {'d': date.strftime('%Y%m'), 'isAggregate': 'false'}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}

response = requests.get(url, params=params, headers=headers)

fixtures = literal_eval(response.content)
print fixtures

印刷品:

^{pr2}$

请注意,响应不是json,而是Python列表的转储,您可以使用^{}加载它:

Safely evaluate an expression node or a Unicode or Latin-1 encoded string containing a Python expression. The string or node provided may only consist of the following Python literal structures: strings, numbers, tuples, lists, dicts, booleans, and None.

相关问题 更多 >