Python Beautiful Soup 爬取包含 JavaScript 的页面

2 投票

1 回答

1340 浏览

提问于 2025-04-18 14:39

我正在尝试从这个页面抓取数据：http://www.scoresway.com/?sport=basketball&page=match&id=45926

但是在获取一些数据时遇到了困难。

页面上的第二个表格包含了主队的统计数据。这个统计数据分为“基础”统计和“高级”统计。下面的代码可以打印出主队的“基础”总统计数据。

from BeautifulSoup import BeautifulSoup
import requests

gameId = 45926
url = 'http://www.scoresway.com/?sport=basketball&page=match&id=' + str(gameId)
r = requests.get(url)
soup = BeautifulSoup(r.content)

for x in soup.findAll('table')[1].findAll('tr')[-1].findAll('td'):
    print ''.join(x.findAll(text=True))

如果你想查看“高级”统计数据，可以点击“高级”链接，它会在同一页面上显示这些数据。我也想抓取这些信息，但不知道该怎么做。

1 个回答

这里有一个单独的请求是针对 advanced 标签的。你可以模拟这个请求，然后用 BeautifulSoup 来解析数据。

比如，下面这段代码可以打印出表格中的所有玩家：

import requests
from bs4 import BeautifulSoup


ADVANCED_URL = "http://www.scoresway.com/b/block.teama_people_match_stat.advanced?has_wrapper=true&match_id=45926&sport=basketball&localization_id=www"

response = requests.get(ADVANCED_URL)
soup = BeautifulSoup(response.text)
print [td.text.strip() for td in soup('td', class_='name')]

输出结果是：

[u'T. Chandler  *', 
 u'K. Durant  *', 
 u'L. James  *',
 u'R. Westbrook',
 ...
 u'C. Anthony']

如果你查看 ADVANCED_URL，你会发现这个网址的 GET 参数中，只有 match_id 和 sport 是“动态”的部分。如果你想让这段代码可以在网站上其他类似的页面上重复使用，你就需要动态地填充 match_id 和 sport。下面是一个示例实现：

from bs4 import BeautifulSoup
import requests

BASE_URL = 'http://www.scoresway.com/?sport={sport}&page=match&id={match_id}'
ADVANCED_URL = "http://www.scoresway.com/b/block.teama_people_match_stat.advanced?has_wrapper=true&match_id={match_id}&sport={sport}&localization_id=www"


def get_match(sport, match_id):
    # basic
    r = requests.get(BASE_URL.format(sport=sport, match_id=match_id))
    soup = BeautifulSoup(r.content)

    for x in soup.findAll('table')[1].findAll('tr')[-1].findAll('td'):
        print ''.join(x.findAll(text=True))

    # advanced
    response = requests.get(ADVANCED_URL.format(sport=sport, match_id=match_id))
    soup = BeautifulSoup(response.text)
    print [td.text.strip() for td in soup('td', class_='name')]


get_match('basketball', 45926)

回答于 2025-04-18 由 Python大师

分享举报

Python Beautiful Soup 爬取包含 JavaScript 的页面

1 个回答

撰写回答