即使元素存在,BeautifulSoup也不返回任何值

2024-04-29 02:41:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经研究了大多数类似问题的解决方案,但还没有找到一个有效的解决方案,更重要的是,我还没有找到一个解释,解释为什么除了Javascript或其他东西在网站上被调用时会出现这种情况。

我正试图从网站上为游戏“官员”刮桌子: http://www.pro-football-reference.com/boxscores/201309050den.htm

我的代码是:

url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = urlopen(url)    
bsObj = BeautifulSoup(html, "lxml")
officials = bsObj.findAll("table",{"id":"officials"})

for entry in officials:
    print(str(entry))

我只是暂时打印到控制台,但是我得到了一个带有findAll的空列表,或者没有带有find的空列表。 我还用basic html.parser尝试过这个方法,但没有成功。

对html有更好理解的人能具体告诉我这个网页有什么不同吗?提前谢谢!


Tags: comhttpurl网站htmlwww解决方案pro
3条回答

请尝试以下代码:

from selenium import webdriver
import time
from bs4 import BeautifulSoup


driver = webdriver.Chrome()
url= "http://www.pro-football-reference.com/boxscores/201309050den.htm"
driver.maximize_window()
driver.get(url)

time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("table",{"id":"officials"})

for entry in officials:
    print(str(entry))


driver.quit()

它将打印:

<table class="suppress_all sortable stats_table now_sortable" data-cols-to-freeze="0" id="officials"><thead><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr></thead><caption>Officials Table</caption><tbody>
<tr data-row="0"><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr data-row="1"><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr data-row="2"><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr data-row="3"><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr data-row="4"><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr data-row="5"><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr data-row="6"><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</tbody></table>

你看不到是因为不在那里。尝试turn JS关闭并用浏览器打开它,您将看到它不在那里-网站进行一些JS DOM操作。

你的选择是:

  1. 在你的例子中,你想要的HTML就在那里-就在comment中,用beautifulsoup从comment中提取出来。
  2. 使用Selenium或等效工具呈现JS(这正是您的浏览器所做的)

它在源代码中,只是被注释掉了,使用regex移除注释很简单:

from bs4 import BeautifulSoup
import requests
import re

url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = requests.get(url).content
bsObj = BeautifulSoup(re.sub("<!--|-->","", html), "lxml")
officials = bsObj.find_all("table",{"id":"officials"})

for entry in officials:
    print(entry)

只有一个表,所以您不需要全部查找,而且您的循环有点无意义,只需使用find

In [1]: from bs4 import BeautifulSoup
   ...: import requests
   ...: import re
   ...: url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
   ...: 
   ...: html = requests.get(url).content
   ...: bsObj = BeautifulSoup(re.sub("<!--|-->","", html), "lxml")
   ...: officials = bsObj.find(id="officials")
   ...: print(officials)
   ...: 

<table class="suppress_all sortable stats_table" data-cols-to-freeze="0" id="officials"><caption>Officials Table</caption><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</table>

In [2]: 

相关问题 更多 >