请求-获取内容类型/大小而不获取整个页面/内容

2024-06-12 02:32:43 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个简单的网站爬虫,它的工作很好,但有时它卡住了,因为大的内容,如ISO图像,.exe文件和其他大的东西。使用文件扩展名猜测内容类型可能不是最好的主意。

是否可以在不获取整个内容/页面的情况下获取内容类型和内容长度/大小?

这是我的代码:

requests.adapters.DEFAULT_RETRIES = 2
url = url.decode('utf8', 'ignore')
urlData = urlparse.urlparse(url)
urlDomain = urlData.netloc
session = requests.Session()
customHeaders = {}
if maxRedirects == None:
    session.max_redirects = self.maxRedirects
else:
    session.max_redirects = maxRedirects
self.currentUserAgent = self.userAgents[random.randrange(len(self.userAgents))]
customHeaders['User-agent'] = self.currentUserAgent
try:
    response = session.get(url, timeout=self.pageOpenTimeout, headers=customHeaders)
    currentUrl = response.url
    currentUrlData = urlparse.urlparse(currentUrl)
    currentUrlDomain = currentUrlData.netloc
    domainWWW = 'www.' + str(urlDomain)
    headers = response.headers
    contentType = str(headers['content-type'])
except:
    logging.basicConfig(level=logging.DEBUG, filename=self.exceptionsFile)
    logging.exception("Get page exception:")
    response = None

Tags: 文件selfurl类型内容responsesessionlogging
3条回答

是的。

您可以使用Session.head方法创建HEAD请求:

response = session.head(url, timeout=self.pageOpenTimeout, headers=customHeaders)
contentType = response.headers['content-type']

类似于GET请求的HEAD请求,只是消息体不会被发送。

这是Wikipedia的一句话:

HEAD Asks for the response identical to the one that would correspond to a GET request, but without the response body. This is useful for retrieving meta-information written in response headers, without having to transport the entire content.

对不起,我错了,我应该把文件看得更清楚些。答案如下: http://docs.python-requests.org/en/latest/user/advanced/#advanced(正文内容工作流)

tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'
r = requests.get(tarball_url, stream=True)
if int(r.headers['content-length']) > TOO_LONG:
  r.connection.close()
  # log request too long

为此使用requests.head()。它不会返回消息正文。如果您只对headers感兴趣,那么应该使用head方法。查看this link了解详细信息。

h = requests.head(some_link)
header = h.headers
content_type = header.get('content-type')

相关问题 更多 >