Python: 爬虫中避免下载未更改页面的最佳算法

1 投票

2 回答

737 浏览

数据工程师

提问于 2025-04-17 03:24

我正在写一个爬虫程序，它会定期检查一系列新闻网站，看看有没有新文章。

我了解了一些避免下载不必要页面的方法，基本上找到了5个可以用来判断页面是否有变化的头部元素：

HTTP状态
ETAG
最后修改时间（可以和If-Modified-Since请求结合使用）
过期时间
内容长度

优秀的 FeedParser.org 似乎实现了一些这些方法。

我在寻找一段在Python（或任何类似语言）中能够做出这种判断的最佳代码。

这可能是类似这样的代码：

def shouldDonwload(url,prev_etag,prev_lastmod,prev_expires, prev_content_length):
    #retrieve the headers, do the magic here and return the decision
    return decision

数据抓取过期时间爬虫内容长度 etag http头部最后修改时间网页变化检测

2 个回答

你需要传递一个包含头信息的字典给 shouldDownload（或者是 urlopen 的结果）：

def shouldDownload(url, headers, prev_etag, prev_lastmod, prev_expires,  prev_content_length):
    return (prev_content_length != headers.get("content-length") || prev_lastmod != headers.get("If-Modified-Since") || prev_expires != headers.get("Expires") || prev_etag != headers.get("ETAG"))
    # or the optimistic way:
    # return prev_content_length == headers.get("content-length") and prev_lastmod == headers.get("If-Modified-Since") and prev_expires = headers.get("Expires") and prev_etag = headers.get("ETAG")

在你打开网址的时候这样做：

# my urllib2 is a little fuzzy but I believe `urlopen()` doesn't 
#  read the whole file until `.read()` is called, and you can still 
#  get the headers with `.headers`.  Worst case is you may have to 
#  `read(50)` or so to get them.
s = urllib2.urlopen(MYURL)
try:
    if shouldDownload(s.headers):
        source = s.read()
        # do stuff with source
   else:
        continue
# except HTTPError, etc if you need to  
finally:
    s.close()

回答于 2025-04-17 由 Python大师

分享举报

在发送请求之前，你需要检查的唯一东西是 Expires。If-Modified-Since 不是服务器发给你的，而是你发给服务器的。

你想要做的是发送一个 HTTP GET 请求，并在请求中加上 If-Modified-Since 这个头，告诉服务器你上次获取这个资源的时间。如果你收到的状态码是 304 而不是通常的 200，那就说明自那时以来这个资源没有被修改，你可以使用你本地存储的副本（服务器不会再发送新的副本）。

另外，你应该保留上次获取文档时的 Expires 头，如果你存储的文档副本还没有过期，就不需要再发送 GET 请求了。

把这些内容用 Python 实现留给你自己去练习，但其实在请求中加上 If-Modified-Since 头、保存响应中的 Expires 头，以及检查响应的状态码都是比较简单的事情。

回答于 2025-04-17 由 Python大师

分享举报

Python: 爬虫中避免下载未更改页面的最佳算法

2 个回答

撰写回答