使用Python和BeautifulSoup从网页提取链接

Question

我想知道怎么用Python获取一个网页上的链接，并复制这些链接的地址。

Answer 1

其他人推荐使用BeautifulSoup，但其实用lxml会更好。虽然名字听起来不一样，但它同样可以用来解析和抓取HTML网页。lxml的速度比BeautifulSoup快得多，而且它对“破损”的HTML处理得也比BeautifulSoup好（这是它的一个亮点）。如果你不想学习lxml的使用方法，它还提供了与BeautifulSoup兼容的接口。

Ian Blicking也同意这个观点。

除非你在Google App Engine等地方，不允许使用非纯Python的东西，否则没有理由再使用BeautifulSoup了。

lxml.html还支持CSS3选择器，所以处理这类事情非常简单。

使用lxml和xpath的例子看起来像这样：

import urllib
import lxml.html
connection = urllib.urlopen('http://www.nytimes.com')

dom =  lxml.html.fromstring(connection.read())

for link in dom.xpath('//a/@href'): # select the url in href for all a tags(links)
    print link

Answer 2

为了完整起见，这里有使用服务器提供的编码的BeautifulSoup 4版本：

from bs4 import BeautifulSoup
import urllib.request

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

还有Python 2的版本：

from bs4 import BeautifulSoup
import urllib2

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, parser, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

以及一个使用requests库的版本，这个版本在Python 2和3中都能使用：

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])

soup.find_all('a', href=True)这个调用会找到所有带有href属性的<a>元素；没有这个属性的元素会被跳过。

BeautifulSoup 3在2012年3月停止了开发；新的项目应该始终使用BeautifulSoup 4。

需要注意的是，你应该让BeautifulSoup自己处理从字节到HTML的解码。你可以告诉BeautifulSoup在HTTP响应头中找到的字符集来帮助解码，但这个信息可能是错误的，可能和HTML内部的<meta>头信息冲突，这就是为什么上面使用了BeautifulSoup内部的类方法EncodingDetector.find_declared_encoding()来确保嵌入的编码提示优先于配置错误的服务器。

使用requests时，如果响应的类型是text/*，那么response.encoding属性默认会是Latin-1，即使没有返回字符集。这与HTTP的标准一致，但在解析HTML时会很麻烦，所以当Content-Type头中没有设置charset时，你应该忽略这个属性。

Answer 3

这里有一段简短的代码，使用了BeautifulSoup中的SoupStrainer类：

import httplib2
from bs4 import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

BeautifulSoup的文档其实写得很好，涵盖了很多常见的使用场景：

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

补充一下：我之所以使用SoupStrainer类，是因为它在内存和速度上更高效，前提是你已经知道自己要解析的内容。

使用Python和BeautifulSoup从网页提取链接

16 个回答

撰写回答