如何用BeautifulSoup从<a href="TextWithUrlBehind">Something</a>提取URL？

0 投票

1 回答

36 浏览

数据工程师

提问于 2025-04-11 23:19

我正在尝试从一个网页的 .json 文件中提取一些链接和文本。

我已经解析了 HTML 中的 tbody > tr > td，每个 td 里面都有一个 <a href="TextWithUrlBehind">Something</a> 的结构。

但是在检查元素时，这个 TextWithUrlBehind 是可以点击的，它后面有一个链接。这个链接不是一个常见的 <a href=https//...> 格式。

所以，我提取到的 href 是 str: TextWithUrlBehind，然后 text(也就是 str):Something 会在 .json 文件中出现。

代码大概是这样的：

rows = test_results_table.find_all("tr")
                
# Iterate over each anchor tag
for row in rows:
    first_cell = row.find("td")
    if first_cell:
        anchor_tag = first_cell.find("a", href=True)
        self._debug_print("Anchor tag content:", anchor_tag)
        if anchor_tag:
            href = anchor_tag["href"]
            text = anchor_tag.get_text(strip=True)
            links.append({"href": href, "text": text})
            self._debug_print("Content extracted:", {"href": href, "text": text})
        else:
            self._debug_print("No anchor tag found in cell:", first_cell)
    else:
        self._debug_print("No table cell found in row:", row)

我不太明白这个链接是怎么在 HTML 中附加上的，也不知道 beautifulsoup 的内置函数能怎么帮我获取这个链接。

html解析 beautifulsoup 网页解析数据抓取超链接链接提取 json处理网页元素

1 个回答

from bs4 import BeautifulSoup as bs
import requests as rq

#Replace <your url> with the url you want to scrap
url ='<your url>'

r=requests.get(url)
soup=bs(r.content,"html.parser")
links = soup.find_all("a") 

# Create an empty dict
dct = {}
for x in links:

    # Get keys of the dict being clickable text and value being links
    key = x.string
    val = x.get("href")
    dct[key] = val
    
print(dct)

输出的结果会是一个字典，字典里的键是可以点击的文本，而对应的值是这些文本点击后会跳转到的链接。

回答于 2025-04-11 由 Python大师

分享举报

如何用BeautifulSoup从<a href="TextWithUrlBehind">Something</a>提取URL？

1 个回答

撰写回答