从html页面获取相关链接 - 问答 - Python中文网

从html页面获取相关链接

2024-05-20 02:31:59 发布

您现在位置：Python中文网/ 问答频道 /正文

男 | 程序猿一只，喜欢编程写python代码。

我只想从html页面提取相对URL；有人建议：

find_re = re.compile(r'\bhref\s*=\s*("[^"]*"|\'[^\']*\'|[^"\'<>=\s]+)', re.IGNORECASE)

但它返回：

1/页面中的所有绝对和相对URL。在

2/url可以由""或''随机量化。在

Tags： re url html 页面 find 建议 compile ignorecase

1条回答

网友

1楼 · 发布于 2024-05-20 02:31:59

使用the tool for the job：一个HTML parser，像^{}。在

您可以pass a function作为^{}的属性值，并检查href是否以http开头：

from bs4 import BeautifulSoup

data = """
<div>
<a href="http://google.com">test1</a>
<a href="test2">test2</a>
<a href="http://amazon.com">test3</a>
<a href="here/we/go">test4</a>
</div>
"""
soup = BeautifulSoup(data)
print soup.find_all('a', href=lambda x: not x.startswith('http'))

或者，使用^{}和checking for network location part：

^{pr2}$

两种解决方案都打印：

[<a href="test2">test2</a>, 
 <a href="here/we/go">test4</a>]

相关问题更多 >

编程相关推荐

热门问题

热门文章