<p>HTTP服务器几乎总是在响应<code>GET</code>或<code>HEAD</code>url请求时返回一个Content-Type报头:</p>
<p><img src="https://i.stack.imgur.com/S15cR.png" alt="enter image description here"/></p>
<p>要处理大量的url,最快的方法是只检索头,而不下载整个文件,并在content-type响应头上检查其mime类型(这里是您必须检查的<a href="http://en.wikipedia.org/wiki/Internet_media_type#Type_image" rel="nofollow noreferrer">image mime types</a>列表)。它们都是以图像开始的/所以这就是您要寻找的)。在</p>
<p>例如,使用pycurl(如果您在windows上,可以使用pip或<a href="http://pycurl.sourceforge.net/download/" rel="nofollow noreferrer">here</a>获得它;对于64位windows,<a href="http://www.lfd.uci.edu/~gohlke/pythonlibs/#pycurl" rel="nofollow noreferrer">here</a>),类似这样的方法会检查响应头(我不太精通python,所以我建议您搜索如何解析Content-Type头,以便更好地检查图像mime类型,并将其正确地封装在函数中):</p>
<pre><code>#!/usr/bin/python
import pycurl
from StringIO import StringIO
import re
def check_image(url):
headers = StringIO()
c = pycurl.Curl()
c.setopt(c.URL, url)
c.setopt(c.HEADER, 1)
c.setopt(pycurl.SSL_VERIFYPEER, 0)
c.setopt(pycurl.SSL_VERIFYHOST, 0) # do not verify ssl certificate
c.setopt(c.NOBODY, 1) # header only, no body
c.setopt(c.HEADERFUNCTION, headers.write)
c.setopt(pycurl.WRITEFUNCTION, lambda x: None)
c.perform()
c.close()
a = re.compile("^.*?Content-Type:( )*image/.*?$", re.IGNORECASE | re.MULTILINE | re.DOTALL)
if a.match(headers.getvalue()) is None:
return False
else:
return True
if check_image('http://www.wikipedia.org/') is False:
print 'The resource in http://www.wikipedia.org/ is not an image'
if check_image('https://encrypted-tbn1.gstatic.com/images?q=tbn%3AANd9GcTwC6cNpAen5dgGgTmmH2SG75xhvTN-oRliaOgG-3meNQVm-GdpUu7SQX5wpA') is True:
print 'The resource in https://encrypted-tbn1.gstatic.com/images?q=tbn%3AANd9GcTwC6cNpAen5dgGgTmmH2SG75xhvTN-oRliaOgG-3meNQVm-GdpUu7SQX5wpA is an image'
</code></pre>