如何用Python3从HTML锚点中提取URL？

-2 投票

1 回答

1666 浏览

提问于 2025-04-18 15:55

我想从网页的HTML源代码中提取网址。
例如：

xyz.com source code:
<a rel="nofollow" href="example/hello/get/9f676bac2bb3.zip">Download XYZ</a>

我想提取：

example/hello/get/9f676bac2bb3.zip

怎么提取这个网址呢？

我不太懂正则表达式。而且我也不知道怎么在Windows上安装Beautiful Soup 4或者lxml。我在尝试安装这些库的时候遇到了错误。

我试过：

C:\Users\admin\Desktop>python
Python 3.3.2 (v3.3.2:d047928ae3f6, May 16 2013, 00:03:43) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> r = re.compile('(?<=href=").*?(?=")')
>>> r.findall(url)
['/example/hello/get/9f676bac2bb3.zip']
>>> url
'<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">Download XYZ</a>'
>>> r.findall(url)[0]
'/example/hello/get/9f676bac2bb3.zip'
>>> a = "https://xyz.com"
>>> print(a + r.findall(url)[0])
https://xyz.com/example/hello/get/9f676bac2bb3.zip
>>>

但这只是一个硬编码的HTML示例。那我该怎么获取网页的源代码，然后运行我的代码呢？

web scraping HTML regular expressions windows installation url extraction source code retrieval

1 个回答

你可以使用内置的 xml.etree.ElementTree 来处理这个问题：

>>> import xml.etree.ElementTree as ET
>>> url = '<a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a>'
>>> ET.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

这个方法在这个特定的例子中是有效的，但 xml.etree.ElementTree 其实并不是一个专门用来解析HTML的工具。你可以考虑使用 BeautifulSoup：

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(url).a.get('href')
'/example/hello/get/9f676bac2bb3.zip'

或者，你也可以使用 lxml.html：

>>> import lxml.html
>>> lxml.html.fromstring(url).attrib.get('href')
'/example/hello/get/9f676bac2bb3.zip'

我个人更喜欢 BeautifulSoup，因为它让解析HTML变得简单、清晰而且有趣。

要跟随链接并下载文件，你需要构建一个完整的URL，包括协议和域名（可以使用 urljoin() 来帮助你），然后使用 urlretrieve()。示例：

>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))

更新（针对评论中提到的不同HTML）：

>>> from bs4 import BeautifulSoup
>>> data = '<html> <head> <body><example><example2> <a rel="nofollow" href="/example/hello/get/9f676bac2bb3.zip">XYZ</a> </example2></example></body></head></html>'
>>> href = BeautifulSoup(data).find('a', text='XYZ').get('href')
'/example/hello/get/9f676bac2bb3.zip'

回答于 2025-04-18 由 Python大师

分享举报

如何用Python3从HTML锚点中提取URL？

1 个回答

撰写回答