BeautifulSoup 获取 href

350 投票

1 回答

690735 浏览

提问于 2025-04-16 16:34

我有以下的 soup：

<a href="some_url">next</a>
<span class="class">...</span>

我想从中提取出链接，也就是 "some_url"。

如果只有一个标签，我可以做到这一点，但这里有两个标签。我也能获取到文本 'next'，但这不是我想要的。

另外，有没有地方可以找到好的API描述和示例？我正在使用标准文档，但我想要一些更有条理的内容。

数据提取网页抓取 html解析网络爬虫 api文档 beautifulsoup 链接提取

1 个回答

543

你可以用 find_all 这个方法来找到每一个带有 href 属性的 a 元素，并把它们逐个打印出来：

# Python2
from BeautifulSoup import BeautifulSoup
    
html = '''<a href="some_url">next</a>
<span class="class"><a href="another_url">later</a></span>'''
    
soup = BeautifulSoup(html)
    
for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

# The output would be:
# Found the URL: some_url
# Found the URL: another_url

# Python3
from bs4 import BeautifulSoup

html = '''<a href="https://some_url.com">next</a>
<span class="class">
<a href="https://some_other_url.com">another_url</a></span>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print("Found the URL:", a['href'])

# The output would be:
# Found the URL: https://some_url.com
# Found the URL: https://some_other_url.com

需要注意的是，如果你使用的是旧版本的 BeautifulSoup（4 之前的版本），这个方法叫做 findAll。在 4 版本中，BeautifulSoup 的方法名称进行了调整，以符合 PEP 8 的规范，所以你应该使用 find_all。

如果你想要找到所有带有 href 的标签，可以不写 name 这个参数：

href_tags = soup.find_all(href=True)

回答于 2025-04-16 由 Python大师

分享举报

BeautifulSoup 获取 href

1 个回答

撰写回答