在Python中网络爬虫时表格未出现

2 投票
1 回答
33 浏览
提问于 2025-04-13 03:21

我遇到了一件有趣的事情,但我搞不清楚发生了什么。我正在尝试从这个HTML中抓取数据:

<div class="table_wrapper setup_long long setup_commented commented" id="all_roster">
 <div class="section_heading assoc_roster" id="roster_sh">
 <span class="section_anchor" data-label="Roster" id="roster_link"></span><h2>Roster</h2> <div class="section_heading_text">
 <ul><li>*ProBowl, +<a href="/about/allpro.htm">1st-tm All-Pro</a></li>
 </ul>
 </div>
 </div><div class="placeholder"></div>
 <!--
 
 <div class="table_container" id="div_roster">
     
     <table class="per_match_toggle sortable stats_table" id="roster" data-cols-to-freeze=",2">
     <caption>Roster Table</caption>
     
 
    <colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
    <thead>      
       <tr>
          <th aria-label="No." data-stat="uniform_number" scope="col" class=" poptip sort_default_asc center" data-tip="Uniform number" >No.</th>
          <th aria-label="Player" data-stat="player" scope="col" class=" poptip sort_default_asc show_partial_when_sorting left" >Player</th>
          <th aria-label="Age" data-stat="age" scope="col" class=" poptip sort_default_asc center" data-tip="Player's age on December 31st of that year" >Age</th>

我想提取的是“Roster”表格。理论上,我应该可以用pd.read_html(link, match = 'Roster')来做到这一点,但当我这么做时却没有任何结果。实际上,当我只用pd.read_html时,返回的结果是我预期的列表,但只有一个表格。我查看了HTML,确认这个表格确实存在,所以我决定试试Beautiful Soup。

table = soup.find_all("div", id = 'all_roster' )
table[0]

正如你所看到的,表格确实存在……但我就是无法让Beautiful Soup或read_html找到这个表格。感觉在"<!--"之后的内容就像不存在一样。有没有人能给点建议?table[0].findAll('tr')返回的是一个空列表。

1 个回答

2

这个表格是在HTML注释里面的,注释的格式是 <!---->。所以要想读取这个表格,首先要把注释去掉,然后可以使用比如 pd.read_html 这样的代码来处理:

from io import StringIO

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = "https://www.pro-football-reference.com/teams/kan/2023_roster.htm"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

div = soup.select_one("#all_roster")
df = pd.read_html(StringIO(str(div).replace("-->", "").replace("<!--", "")))[0]

print(df)

输出结果是:

     No.                    Player  Age  Pos   G    GS     Wt     Ht                       College/Univ   BirthDate   Yrs    AV                             Drafted (tm/rnd/yr)
0   73.0           Nick Allegretti   27    G  17   1.0  310.0    6-4                           Illinois   4/21/1996     4   1.0    Kansas City Chiefs / 7th / 216th pick / 2019
1   97.0      Felix Anudike-Uzomah   21   DE  17   0.0  255.0    NaN                         Kansas St.   1/24/2002  Rook   1.0     Kansas City Chiefs / 1st / 31st pick / 2023
2   81.0                Blake Bell   32   TE  17   3.0  252.0    6-6                           Oklahoma    8/7/1991     8   0.0   San Francisco 49ers / 4th / 117th pick / 2015
3   32.0               Nick Bolton   23   LB   8   8.0  237.0   5-11                           Missouri   3/10/2000     2   4.0     Kansas City Chiefs / 2nd / 58th pick / 2021
4   40.0             Ekow Boye-Doe   24   CB   6   0.0  171.0    6-0                         Kansas St.   11/4/1999  Rook   0.0                                             NaN
5   26.0                 Deon Bush   30   DB   6   0.0  200.0    6-0                         Miami (FL)   8/14/1993     7   0.0         Chicago Bears / 4th / 124th pick / 2016
6   89.0              Matt Bushman   28   TE   1   0.0  245.0    6-5                                BYU   11/3/1995     2   0.0                                             NaN
7    7.0           Harrison Butker   28    K  17   0.0  199.0    6-4                       Georgia Tech   7/14/1995     6   5.0     Carolina Panthers / 7th / 233rd pick / 2017

...

撰写回答