刮网站,为什么代码给出不同的结果?

2024-05-13 06:47:58 发布

您现在位置:Python中文网/ 问答频道 /正文

为什么当我点击网站,看到网站的源代码,它给了我一个有组织的结果。你知道吗

我想要的只是日期,队伍和分数。你知道吗

这是我的Python代码:

import requests, bs4
from bs4 import BeautifulSoup
r=requests.get("https://www.scoreboard.com/mls/results/")
soup=bs4.BeautifulSoup(r.content,"lxml")
print(soup.prettify())

这是我搜索“洛杉矶”的结果。我的代码返回:

enter image description here

但是当我打开网站的源代码:https://www.scoreboard.com/mls/results/ 它告诉我:

enter image description here

我不知道为什么Python的结果完全不同。你知道吗


Tags: 代码httpsimportcom源代码网站wwwrequests
2条回答

这个站点使用自己的feed语法,似乎它们使用~作为行分隔符,–作为对象分隔符,÷作为键/值。因此,以下是:

SA÷1¬~ZA÷USA: MLS¬ZEE÷CQv5qrFt¬ZB÷200¬ZY÷USA¬ZC÷zRdKgb4m¬ZD÷t¬ZE÷KM2qHMND¬ZF÷0¬ZO÷0¬ZG÷1¬ZH÷200_CQv5qrFt¬ZJ÷2¬ZL÷/mls/¬ZX÷04USA 003......0000000000179000MLS 003......000¬ZS÷2018¬ZCC÷0¬ZAF÷USA¬~AA÷bTyQbGCR¬AD÷1528943400¬ADE÷1528943400¬AB÷3¬CR÷3¬AC÷3¬CX÷San Jose Earthquakes¬RW÷0¬AX÷1¬BX÷-1¬HMC÷1¬WQ÷¬WM÷JOS¬AE÷San Jose Earthquakes¬JA÷Ms72iE3l¬WU÷san-jose-earthquakes¬AS÷0¬AZ÷0¬AG÷2¬BA÷1¬BC÷1¬WN÷ENG¬AF÷New England Revolution¬JB÷G466jYIf¬WV÷new-england-revolution¬AS÷0¬AZ÷0¬AH÷2¬BB÷2¬BD÷0¬AW÷1¬

变成json格式(json对象在这里表示一行):

{
    "SA" : "1",
},
{
    "ZA": "USA : MLS",
    "ZEE": "CQv5qrFt",
    "ZB": "200",
    "ZY": "USA",
    "ZC": "zRdKgb4m",
    "ZD": "t",
    "ZE": "KM2qHMND",
    "ZF": "0",
    "ZO": "0",
    "ZG": "1",
    "ZH": "200_CQv5qrFt",
    "ZJ": "2",
    "ZL": "/mls/",
    "ZX": "04USA         003......0000000000179000MLS         003......000",
    "ZS": "2018",
    "ZCC": "0",
    "ZAF": "USA"
}

如果您查看https://www.scoreboard.com/x/js/core_500_1495000000.js它包含缩小的代码,只需在这里将eval替换为console.log,即可打印整个代码并查找关键字名称,如ZEE、ZA、ZD等。。。你会得出这样的结论:

{
    "sportId" : "1",
},
{
    "tournamentName": "USA : MLS",
    "tournamentTemplateId": "CQv5qrFt",
    "countryId": "200",
    "countryName": "USA",
    "tournamentStageId": "zRdKgb4m",
    "tournamentType": "t",
    "tournamentId": "KM2qHMND",
    "sourceType": "0",
    "hasLiveTable": "0",
    "statsType": "1",
    "tournamentTemplateKey": "200_CQv5qrFt",
    "tournamentStageType": "2",
    "tournamentTemplateUrl": "/mls/",
    "sortKey": "04USA         003......0000000000179000MLS         003......000",
    "seasonUrl": "2018",
    "stagesCount": "0",
    "categoryCaption": "USA"
}

对于锦标赛描述,以下行描述了表中的所有项目,例如一行:

{
    "eventId": "GOSl9rMa",
    "matchStartUtime": "1528938000",
    "eventStartUtime": "1528938000",
    "eventStageTypeId": "3",
    "eventStageTypeFromEventStageId": "3",
    "eventStageId": "3",
    "sortParticipant": "Colorado Rapids",
    "cricketVisibleRunRate": "0",
    "hasLineups": "1",
    "gameTime": "-1",
    "hasMatchComments": "1",
    "cricketRecentOvers": "",
    "home3CharName": "COL",
    "homeParticipantName": "Colorado Rapids",
    "eventParticipantId": "2BPTi8xM",
    "participantNameUrl": "colorado-rapids",
    "winner": "0",
    "ftWinner": "0",
    "homeCurrentResult": "2",
    "homeResultPeriod1": "2",
    "homeResultPeriod2": "0",
    "away3CharName": "CHI",
    "awayParticipantName": "Chicago Fire",
    "awayParticipantId": "t2OXjSiS",
    "awayParticipantNameUrl": "chicago-fire",
    "winner": "0",
    "ftWinner": "0",
    "awayRedCardCount": "1",
    "awayCurrentResult": "2",
    "awayResultPeriod1": "2",
    "awayResultPeriod2": "0",
    "hasLiveCenter": "1"
}

请注意,此格式中可以存在多个相同的键(在本例中与json没有严格的类比)

requests.get(url)

此代码将向url发出http请求,webserver将返回网站源代码代码。如果在Chrome中单击CTRL + U,源代码将与python抓取的结果相同。你知道吗

您要求的不同结果是,数据将在网站加载后加载,它将由webiste的javascript加载。换句话说,您需要的数据是由Ajax加载的。你知道吗

你可以打开Chrome->;F12->;Network->;XHR->;刷新你想要浏览的网站。你知道吗

注意Chrome的日志。通常可以通过这种方式获得Ajax数据。有时你应该转换格式。你知道吗

像您的网站一样,我找到两个地址来获取Ajax数据:

  • https://www.scoreboard.com/x/feed/mc_8

  • https://d.scoreboard.com/x/feed/tr_1_200_CQv5qrFt_155_1_8_en-usa_1

但是您需要做一些事情,比如根据相应的js代码转换格式。你知道吗

相关问题 更多 >