使用BeautifulSoup提取锚标签值

1 投票

2 回答

6823 浏览

提问于 2025-04-16 22:09

我正在尝试使用BeautifulSoup从一个网站提取数据。这些数据基本上是搜索结果，在这个例子中是某个地区的药店。我要提取的页面的HTML内容如下：

<a id="body_BusinessSearchResultSummaryList_repBusinessList_lnkBusinessProfile_1" class="sr-item-link" href="http://www.mocality.co.ke/b/applegene-pharmacy/applegene/brooklyn/health-and-beauty-medical/_/airtime-chemist-cosmetics-medicine/d42f7388-3f9b-4a34-8971-dc6ae9692586?skw=pharmacys&amp;rcnt=10">Applegene Pharmacy</a>

这个链接标签的id是根据结果递增的，所以下一个链接的id是2：

<a id="body_BusinessSearchResultSummaryList_repBusinessList_lnkBusinessProfile_2" class="sr-item-link" href="http://www.mocality.co.ke/b/natros-pharmacy/natrosoh/innercore/medical-services/_/_/0cfe6a11-7bee-41f8-8d2e-6a472557201f?skw=pharmacys&amp;rcnt=10">Natros Pharmacy</a>

我用findAll('a')方法，但这会给我所有的链接标签。我该如何使用BeautifulSoup来解析这些内容，并提取特定链接标签的值呢？

数据提取网页抓取 html解析 beautifulsoup 搜索结果锚标签

2 个回答

使用find的关键字参数，这样可以限制属性：

find("a", id="whatever_1")

你也可以用一个（布尔）函数来调用find：

def isRight(tag):
    return ...

findAll(isRight)

回答于 2025-04-16 由 Python大师

分享举报

在编程中，有时候我们会遇到一些问题，比如代码运行不正常或者出现错误。这个时候，我们可以去一些技术论坛，比如StackOverflow，去寻找解决办法。在这些论坛上，很多人会分享他们的经验和解决方案，帮助其他人解决类似的问题。

比如，有人可能会问：“我的代码为什么不工作？” 然后其他人就会根据他们的经验，给出一些建议，比如检查代码的某个部分，或者看看是否有拼写错误。这些建议通常是基于他们自己遇到过的类似问题。

总之，技术论坛是一个很好的地方，可以让你找到解决问题的方法，也能让你学到很多新的知识。

from BeautifulSoup import BeautifulSoup

txt = '''<a id="body_BusinessSearchResultSummaryList_repBusinessList_lnkBusinessProfile_1" class="sr-item-link" href="http://www.mocality.co.ke/b/natros-pharmacy/natrosoh/innercore/medical-services/_/_/0cfe6a11-7bee-41f8-8d2e-6a472557201f?skw=pharmacys&amp;rcnt=10">Natros Pharmacy</a>
<a id="body_BusinessSearchResultSummaryList_repBusinessList_lnkBusinessProfile_2" class="sr-item-link
" href="http://www.mocality.co.ke/b/natros-pharmacy/natrosoh/innercore/medical-services/_/_/0cfe6a11-
7bee-41f8-8d2e-6a472557201f?skw=pharmacys&amp;rcnt=10">Natros Pharmacy</a>'''
match = 'body_BusinessSearchResultSummaryList_repBusinessList_lnkBusinessProfile'

soup = BeautifulSoup(txt)
for a in soup.findAll('a'):
        if a.has_key('id') and a['id'].startswith(match):
               print a['href'], a.contents

回答于 2025-04-16 由 Python大师

分享举报

使用BeautifulSoup提取锚标签值

2 个回答

撰写回答