如何将抓取的HTML文档转换为数据框?
我正在尝试从FBRef网站上抓取足球运动员的数据,我从网站上获取的数据是一个bs4.element.ResultSet
对象。
代码如下:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")
comp = re.compile("<!--|-->")
soup = BeautifulSoup(comp.sub("",res.text),'lxml')
all_data = soup.findAll("tbody")
player_data = all_data[2]
数据如下:
<tr><th class="right" **...** href="/en/players/774cf58b/Max-Aarons">Max Aarons</a></td><td **...** data-stat="position">DF</td><td class="left" data-stat="team"><a href="/en/squads/4ba7cbea/Bournemouth-Stats">Bournemouth</a></td><td class="center" data-stat="age">24-084</td><td class="center" data-stat="birth_year">2000</td><td**...** </a></td></tr>
<tr><th class="right" **...** href="/en/players/77816c91/Benie-Adama-Traore">Bénie Adama Traore</a></td><td **...** data-stat="position">FW,MF</td><td class="left" data-stat="team"><a href="/en/squads/1df6b87e/Sheffield-United-Stats">Sheffield Utd</a></td><td class="center" data-stat="age">21-119</td><td class="center" data-stat="birth_year">2002 **...** </a></td></tr>
**...**
我想根据这些数据创建一个Pandas数据框,像这样:
**Name Position Team Age Birth Year** **...**
Max Aarons DF Bournemouth 24 2000
Benie Adama Traore FW Sheffield Utd 21 2002
**...**
我在这里查找了类似的问题,并尝试应用解决方案,但没有成功。
2 个回答
1
我建议使用 pd.read_html
这个方法,直接把HTML代码读取到数据框(dataframe)中:
import re
from io import StringIO
import pandas as pd
import requests
res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")
comp = re.compile("<!--|-->")
df = pd.read_html(StringIO(comp.sub("", res.text)))[2] # <-- locate the right table
print(df)
输出结果:
Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 Unnamed: 3_level_0 Unnamed: 4_level_0 Unnamed: 5_level_0 Unnamed: 6_level_0 Playing Time Performance Expected Progression Per 90 Minutes Unnamed: 36_level_0
Rk Player Nation Pos Squad Age Born MP Starts Min 90s Gls Ast G+A G-PK PK PKatt CrdY CrdR xG npxG xAG npxG+xAG PrgC PrgP PrgR Gls Ast G+A G-PK G+A-PK xG xAG xG+xAG npxG npxG+xAG Matches
0 1 Max Aarons eng ENG DF Bournemouth 24-085 2000 14 12 1085 12.1 0 1 1 0 0 0 1 0 0.0 0.0 0.8 0.8 19 40 22 0.00 0.08 0.08 0.00 0.08 0.00 0.07 0.07 0.00 0.07 Matches
1 2 Bénie Adama Traore ci CIV FW,MF Sheffield Utd 21-120 2002 8 3 387 4.3 0 0 0 0 0 0 0 0 0.3 0.3 0.5 0.8 7 9 14 0.00 0.00 0.00 0.00 0.00 0.06 0.13 0.19 0.06 0.19 Matches
2 3 Tyler Adams us USA MF Bournemouth 25-044 1999 1 0 20 0.2 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0 1 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 Matches
3 4 Tosin Adarabioyo eng ENG DF Fulham 26-187 1997 15 13 1173 13.0 1 0 1 1 0 0 1 0 0.6 0.6 0.1 0.6 5 39 3 0.08 0.00 0.08 0.08 0.08 0.04 0.01 0.05 0.04 0.05 Matches
4 5 Elijah Adebayo eng ENG FW Luton Town 26-082 1998 23 13 1162 12.9 9 0 9 9 0 0 1 0 5.6 5.6 0.7 6.3 14 19 85 0.70 0.00 0.70 0.70 0.70 0.43 0.05 0.49 0.43 0.49 Matches
5 6 Simon Adingra ci CIV FW Brighton 22-088 2002 21 16 1446 16.1 6 1 7 6 0 0 2 0 3.1 3.1 2.3 5.4 72 32 199 0.37 0.06 0.44 0.37 0.44 0.19 0.14 0.34 0.19 0.34 Matches
...
1
要从抓取的数据创建一个Pandas数据框(DataFrame),你可以遍历这些标签,提取每个标签中相关的信息,然后把这些信息添加到一个列表里。最后,你可以用这个列表来创建数据框。下面是具体的做法:
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get("https://fbref.com/en/comps/9/stats/Premier-League-Stats")
soup = BeautifulSoup(res.text, 'lxml')
player_data = soup.find_all("tbody")[2]
data = []
for row in player_data.find_all("tr"):
name = row.find("a").text
position = row.find("td", {"data-stat": "position"}).text
team = row.find("td", {"data-stat": "team"}).text
age = row.find("td", {"data-stat": "age"}).text
birth_year = row.find("td", {"data-stat": "birth_year"}).text
data.append([name, position, team, age, birth_year])
df = pd.DataFrame(data, columns=['Name', 'Position', 'Team', 'Age', 'Birth Year'])
print(df)
这段代码会从抓取的数据中创建一个包含'姓名'、'职位'、'球队'、'年龄'和'出生年份'这几列的数据框。