如何从HTML页面中提取所有链接名称

-1 投票

3 回答

972 浏览

提问于 2025-04-30 19:56

没有使用任何库...

我想从一个网页上获取所有链接的标题，下面是我的代码：

url="http://einstein.biz/"
m = urllib.request.urlopen(url)
msg = m.read()
titleregex=re.compile('<a\s*href=[\'|"].*?[\'"].*?>(.+?)</a>')
titles = titleregex.findall(str(msg))
print(titles)

这些标题是：

['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store', '\\xe6\\x97\\xa5\\xe6\\x9c\\xac\\xe8\\xaa\\x9e', '<img\\n\\t\\tsrc="http://corbisrightsceleb.122.2O7.net/b/ss/corbisrightsceleb/1/H.14--NS/0"\\n\\t\\theight="1" width="1" border="0" alt="" />']

这样不太理想，我希望只得到如下的结果：

['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store']

我该如何修改我的代码呢？

暂无标签

3 个回答

我更喜欢使用 lxml.html 而不是 BeautifulSoup，因为它支持 xpath 和 css 选择器。

import requests
import lxml.html

res = requests.get("http://einstein.biz/")
doc = lxml.html.fromstring(res.content)
links = doc.cssselect("a")
for l in links:
    print l.text

回答于 2025-04-30 由 Python大师

分享举报

我建议你看看BeautifulSoup，正如@serge提到的那样。为了让你更信服，我还附上了能完全满足你需求的代码。

from bs4 import BeautifulSoup
soup = BeautifulSoup(msg)          #Feed BeautifulSoup your html.
for link in soup.find_all('a'):    #Look at all the 'a' tags.
    print(link.string)             #Print out the descriptions.

返回结果

Photo Gallery
Bio
Quotes
Links
Contact
official store

回答于 2025-04-30 由 Python大师

分享举报

在处理html或xml文件时，你必须使用BeautifulSoup这个工具。

>>> url="http://einstein.biz/"
>>> import urllib.request
>>> m = urllib.request.urlopen(url)
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(m)
>>> s = soup.find_all('a')
>>> [i.string for i in s]
['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store', '日本語', None]

更新：

>>> import urllib.request
>>> url="http://einstein.biz/"
>>> m = urllib.request.urlopen(url)
>>> msg = m.read()
>>> regex = re.compile(r'(?s)<a\s*href=[\'"].*?[\'"][^<>]*>([A-Za-z][^<>]*)</a>')
>>> titles = regex.findall(str(msg))
>>> print(titles)
['Photo Gallery', 'Bio', 'Quotes', 'Links', 'Contact', 'official store']

回答于 2025-04-30 由 Python大师

分享举报

如何从HTML页面中提取所有链接名称

3 个回答

撰写回答