如何在Python中从HTML页面提取链接?

1 投票
2 回答
24 浏览
提问于 2025-04-14 16:01

从这段Python代码开始,

...
resp = logout_session.get(logout_url, headers=headers, verify=False, allow_redirects=False)
soup = BeautifulSoup(resp.content, "html.parser")
print(soup.prettify())

我成功地进行了一个API调用,得到的响应内容是这样的:

<!DOCTYPE html>
<html>
 <head>...</head>
 <body>
  <div class="container">
   <div class="title logo" id="header">
    <img alt="" id="business-logo-login" src="/customviews/image/business_logo:f0a067275aba3c71c62cffa2f50ac69c/"/>
   </div>
   <div class="input-group alert alert-success text-center" id="title" role="alert">
    Successfully signed out
   </div>
   <div class="input-group alert text-center">
    <a href="/saml-idp/portal/">
     Login again
    </a>
   </div>
   <div>
    <p>
     You will be redirected to https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/ after 5 seconds ...
    </p>
    <script language="javascript" nonce="">
     window.onload = window.setTimeout(function() {
    window.location.replace("https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU");}, 5000);
    </script>
   </div>
  </div>
 </body>
</html>

现在我想从这些内容中提取出HTML链接:

https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU 

有没有人知道怎么用Python来做到这一点?

2 个回答

1

你可以试试这样做:

from bs4 import bs
api ="""your response above"""
soup = bs(api,"html.parser")
scr = soup.select_one('script').string
scr.split('"')[1]

输出的结果应该是网址。

1

试试这个:

import re

# resp = requests.get(...)

url = re.search(r'window\.location\.replace\("([^"]+)', resp.text).group(1)
print(url)

输出结果是:

https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU

撰写回答