在Python中网络爬虫时表格未出现
我遇到了一件有趣的事情,但我搞不清楚发生了什么。我正在尝试从这个HTML中抓取数据:
<div class="table_wrapper setup_long long setup_commented commented" id="all_roster">
<div class="section_heading assoc_roster" id="roster_sh">
<span class="section_anchor" data-label="Roster" id="roster_link"></span><h2>Roster</h2> <div class="section_heading_text">
<ul><li>*ProBowl, +<a href="/about/allpro.htm">1st-tm All-Pro</a></li>
</ul>
</div>
</div><div class="placeholder"></div>
<!--
<div class="table_container" id="div_roster">
<table class="per_match_toggle sortable stats_table" id="roster" data-cols-to-freeze=",2">
<caption>Roster Table</caption>
<colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
<thead>
<tr>
<th aria-label="No." data-stat="uniform_number" scope="col" class=" poptip sort_default_asc center" data-tip="Uniform number" >No.</th>
<th aria-label="Player" data-stat="player" scope="col" class=" poptip sort_default_asc show_partial_when_sorting left" >Player</th>
<th aria-label="Age" data-stat="age" scope="col" class=" poptip sort_default_asc center" data-tip="Player's age on December 31st of that year" >Age</th>
我想提取的是“Roster”表格。理论上,我应该可以用pd.read_html(link, match = 'Roster')来做到这一点,但当我这么做时却没有任何结果。实际上,当我只用pd.read_html时,返回的结果是我预期的列表,但只有一个表格。我查看了HTML,确认这个表格确实存在,所以我决定试试Beautiful Soup。
table = soup.find_all("div", id = 'all_roster' )
table[0]
正如你所看到的,表格确实存在……但我就是无法让Beautiful Soup或read_html找到这个表格。感觉在"<!--"之后的内容就像不存在一样。有没有人能给点建议?table[0].findAll('tr')返回的是一个空列表。
1 个回答
2
这个表格是在HTML注释里面的,注释的格式是 <!--
和 -->
。所以要想读取这个表格,首先要把注释去掉,然后可以使用比如 pd.read_html
这样的代码来处理:
from io import StringIO
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = "https://www.pro-football-reference.com/teams/kan/2023_roster.htm"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
div = soup.select_one("#all_roster")
df = pd.read_html(StringIO(str(div).replace("-->", "").replace("<!--", "")))[0]
print(df)
输出结果是:
No. Player Age Pos G GS Wt Ht College/Univ BirthDate Yrs AV Drafted (tm/rnd/yr)
0 73.0 Nick Allegretti 27 G 17 1.0 310.0 6-4 Illinois 4/21/1996 4 1.0 Kansas City Chiefs / 7th / 216th pick / 2019
1 97.0 Felix Anudike-Uzomah 21 DE 17 0.0 255.0 NaN Kansas St. 1/24/2002 Rook 1.0 Kansas City Chiefs / 1st / 31st pick / 2023
2 81.0 Blake Bell 32 TE 17 3.0 252.0 6-6 Oklahoma 8/7/1991 8 0.0 San Francisco 49ers / 4th / 117th pick / 2015
3 32.0 Nick Bolton 23 LB 8 8.0 237.0 5-11 Missouri 3/10/2000 2 4.0 Kansas City Chiefs / 2nd / 58th pick / 2021
4 40.0 Ekow Boye-Doe 24 CB 6 0.0 171.0 6-0 Kansas St. 11/4/1999 Rook 0.0 NaN
5 26.0 Deon Bush 30 DB 6 0.0 200.0 6-0 Miami (FL) 8/14/1993 7 0.0 Chicago Bears / 4th / 124th pick / 2016
6 89.0 Matt Bushman 28 TE 1 0.0 245.0 6-5 BYU 11/3/1995 2 0.0 NaN
7 7.0 Harrison Butker 28 K 17 0.0 199.0 6-4 Georgia Tech 7/14/1995 6 5.0 Carolina Panthers / 7th / 233rd pick / 2017
...