使用BeautifulSoup查找特定链接

12 投票

3 回答

20012 浏览

数据工程师

提问于 2025-04-17 04:07

你好，我实在搞不懂怎么找到以特定文本开头的链接。用findall('a')是可以的，但结果太多了。我只想列出所有以 http://www.nhl.com/ice/boxscore.htm?id= 开头的链接。

有人能帮我吗？

非常感谢！

数据提取网页抓取 html解析 beautifulsoup 爬虫技术链接过滤

3 个回答

你可以先找到所有的链接，然后再从中筛选出你需要的那些链接。这个方法会非常快，尽管你是在之后进行筛选。

listOfAllLinks = soup.findAll('a')
listOfLinksINeed = []

for link in listOfAllLinks:
    if "www.nhl.com" in link:
        listOfLinksINeed.append(link['href'])

回答于 2025-04-17 由 Python大师

分享举报

你可能不需要用到BeautifulSoup，因为你的搜索很具体。

>>> import re
>>> links = re.findall("http:\/\/www\.nhl\.com\/ice\/boxscore\.htm\?id=.+", str(doc))

回答于 2025-04-17 由 Python大师

分享举报

首先，创建一个测试文档，然后用BeautifulSoup打开解析器：

>>> from BeautifulSoup import BeautifulSoup
>>> doc = '<html><body><div><a href="something">yep</a></div><div><a href="http://www.nhl.com/ice/boxscore.htm?id=3">somelink</a></div><a href="http://www.nhl.com/ice/boxscore.htm?id=7">another</a></body></html>'
>>> soup = BeautifulSoup(doc)
>>> print soup.prettify()
<html>
 <body>
  <div>
   <a href="something">
    yep
   </a>
  </div>
  <div>
   <a href="http://www.nhl.com/ice/boxscore.htm?id=3">
    somelink
   </a>
  </div>
  <a href="http://www.nhl.com/ice/boxscore.htm?id=7">
   another
  </a>
 </body>
</html>

接下来，我们可以查找所有带有<a>标签的链接，这些链接的href属性是以http://www.nhl.com/ice/boxscore.htm?id=开头的。你可以使用正则表达式来实现这一点：

>>> import re
>>> soup.findAll('a', href=re.compile('^http://www.nhl.com/ice/boxscore.htm\?id='))
[<a href="http://www.nhl.com/ice/boxscore.htm?id=3">somelink</a>, <a href="http://www.nhl.com/ice/boxscore.htm?id=7">another</a>]

回答于 2025-04-17 由 Python大师

分享举报

使用BeautifulSoup查找特定链接

3 个回答

撰写回答