从HTML中获取所有链接，包括“显示更多”链接

1 投票

2 回答

557 浏览

提问于 2025-04-17 16:08

我正在使用Python和BeautifulSoup来解析HTML。

我用的代码是：

from BeautifulSoup import BeautifulSoup
import urllib2
import re

url = "http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query"

main_url = urllib2.urlopen(url)
content = main_url.read()
soup = BeautifulSoup(content)

for a in soup.findAll('a',href=True):
    print a[href]

但是我没有得到像这样的输出链接： http://www.wikipathways.org/index.php/Pathway:WP26

还有一个重要的事情是，总共有107条路径。但是我无法获取所有的链接，因为其他链接依赖于页面底部的“显示链接”按钮。

那么，我该如何从这个网址获取所有的链接（107个链接）呢？

web scraping HTML beautifulsoup data parsing web automation pagination links extraction url retrieval

2 个回答

我建议你使用 lxml，因为它更快，而且解析HTML的效果更好，值得花时间去学习。

from lxml.html import parse
dom = parse('http://www.wikipathways.org//index.php?query=signal+transduction+pathway&species=Homo+sapiens&title=Special%3ASearchPathways&doSearch=1&ids=&codes=&type=query').getroot()
links = dom.cssselect('a')

这样你就可以开始了。

回答于 2025-04-17 由 Python大师

分享举报

你的问题出在第8行，content = url.read()。你实际上并没有读取网页，反而是在无所事事（如果有的话，你应该会遇到一个错误）。

main_url 是你想要读取的内容，所以把第8行改成：

content = main_url.read()

你还有另一个错误，print a[href]。这里的 href 应该是一个字符串，所以应该改成：

print a['href']

回答于 2025-04-17 由 Python大师

分享举报

从HTML中获取所有链接，包括“显示更多”链接

2 个回答

撰写回答