如何在Python中从HTML页面提取链接?
从这段Python代码开始,
...
resp = logout_session.get(logout_url, headers=headers, verify=False, allow_redirects=False)
soup = BeautifulSoup(resp.content, "html.parser")
print(soup.prettify())
我成功地进行了一个API调用,得到的响应内容是这样的:
<!DOCTYPE html>
<html>
<head>...</head>
<body>
<div class="container">
<div class="title logo" id="header">
<img alt="" id="business-logo-login" src="/customviews/image/business_logo:f0a067275aba3c71c62cffa2f50ac69c/"/>
</div>
<div class="input-group alert alert-success text-center" id="title" role="alert">
Successfully signed out
</div>
<div class="input-group alert text-center">
<a href="/saml-idp/portal/">
Login again
</a>
</div>
<div>
<p>
You will be redirected to https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/ after 5 seconds ...
</p>
<script language="javascript" nonce="">
window.onload = window.setTimeout(function() {
window.location.replace("https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU");}, 5000);
</script>
</div>
</div>
</body>
</html>
现在我想从这些内容中提取出HTML链接:
https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU
有没有人知道怎么用Python来做到这一点?
2 个回答
1
你可以试试这样做:
from bs4 import bs
api ="""your response above"""
soup = bs(api,"html.parser")
scr = soup.select_one('script').string
scr.split('"')[1]
输出的结果应该是网址。
1
试试这个:
import re
# resp = requests.get(...)
url = re.search(r'window\.location\.replace\("([^"]+)', resp.text).group(1)
print(url)
输出结果是:
https://idpftc.business.com/saml/Gy736KPK3v1aWDPECRZKAn/proxy_logout/?SAMLResponse=3VjJkuNIjv2VtKijLJObJIphlWnGfd93Xtoo7vsukvr6ZkRU