从HTML页面获取相对链接

1 投票

1 回答

915 浏览

提问于 2025-04-18 11:27

我想从网页中提取出相对链接，有人给我建议了这个方法：

find_re = re.compile(r'\bhref\s*=\s*("[^"]*"|\'[^\']*\'|[^"\'<>=\s]+)', re.IGNORECASE)

但是它返回的结果是：

1. 网页中的所有绝对链接和相对链接。

2. 链接可能会随机用 "" 或 '' 包起来。

数据提取网页抓取 html解析相对链接

1 个回答

使用合适的工具：一个HTML解析器，比如BeautifulSoup。

你可以传递一个函数作为属性值给find_all()，然后检查一下href是否以http开头：

from bs4 import BeautifulSoup

data = """
<div>
<a href="http://google.com">test1</a>
<a href="test2">test2</a>
<a href="http://amazon.com">test3</a>
<a href="here/we/go">test4</a>
</div>
"""
soup = BeautifulSoup(data)
print soup.find_all('a', href=lambda x: not x.startswith('http'))

或者，使用urlparse和检查网络位置部分：

def is_relative(url):
    return not bool(urlparse.urlparse(url).netloc)

print soup.find_all('a', href=is_relative)

这两种方法都会输出：

[<a href="test2">test2</a>, 
 <a href="here/we/go">test4</a>]

回答于 2025-04-18 由 Python大师

分享举报

从HTML页面获取相对链接

1 个回答

撰写回答