使用urllib2时302重定向丢失Cookies

3 投票

4 回答

3708 浏览

提问于 2025-04-16 15:04

我正在使用liburl2配合CookieJar和HTTPCookieProcessor，想要模拟登录一个页面，以便自动上传文件。

我看到过一些相关的问题和答案，但没有一个能解决我的困扰。在模拟登录时，我的cookie丢失了，导致出现302重定向。302响应是服务器设置cookie的地方，但urllib2的HTTPCookieProcessor在重定向时似乎没有保存这个cookie。我尝试创建一个HTTPRedirectHandler类来忽略重定向，但这并没有奏效。我还试图在全局范围内引用CookieJar，以便从HTTPRedirectHandler处理cookie，但1. 这样做没有成功（因为我在处理重定向的头信息，而我使用的CookieJar函数extract_cookies需要一个完整的请求），2. 这种处理方式也不太好。

我可能需要一些指导，因为我对Python还不太熟悉。我觉得我大致上是在正确的方向上，但可能关注的点不太对。

cj = cookielib.CookieJar()
cookieprocessor = urllib2.HTTPCookieProcessor(cj)


class MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
  def http_error_302(self, req, fp, code, msg, headers):
    global cj
    cookie = headers.get("set-cookie")
    if cookie:
      # Doesn't work, but you get the idea
      cj.extract_cookies(headers, req)

    return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)

  http_error_301 = http_error_303 = http_error_307 = http_error_302

cookieprocessor = urllib2.HTTPCookieProcessor(cj)

# Oh yeah.  I'm using a proxy too, to follow traffic.
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8888'})
opener = urllib2.build_opener(MyHTTPRedirectHandler, cookieprocessor, proxy)

补充：我也尝试过使用mechanize，但没有成功。这可能是一个新问题，但我在这里提出来，因为目标是一样的：

这段简单的代码使用mechanize，当用在一个返回302的链接（http://fxfeeds.mozilla.com/firefox/headlines.xml）时——注意，即使不使用set_handle_robots(False)也会出现相同的情况。我只是想确保这不是问题所在：

import urllib2, mechanize

browser = mechanize.Browser()
browser.set_handle_robots(False)
opener = mechanize.build_opener(*(browser.handlers))
r = opener.open("http://fxfeeds.mozilla.com/firefox/headlines.xml")

输出：

Traceback (most recent call last):
  File "redirecttester.py", line 6, in <module>
    r = opener.open("http://fxfeeds.mozilla.com/firefox/headlines.xml")
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 204, in open
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 457, in http_response
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 221, in error
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 332, in _call_chain
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_urllib2_fork.py", line 571, in http_error_302
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_opener.py", line 188, in open
  File "build/bdist.macosx-10.6-universal/egg/mechanize/_mechanize.py", line 71, in http_request
AttributeError: OpenerDirector instance has no attribute '_add_referer_header'

有什么想法吗？

http请求 urllib2 mechanize 网络爬虫 cookiejar 自动登录 302重定向 httpcookieprocessor

4 个回答

我刚刚让下面的代码变得可以用了，至少在尝试从这个链接读取Atom内容的时候是这样：http://www.fudzilla.com/home?format=feed&type=atom

我不能保证下面的代码片段能直接运行，但可能能给你一个开始的思路：

import cookielib
cookie_jar = cookielib.LWPCookieJar()
cookie_handler = urllib2.HTTPCookieProcessor(cookie_jar)
handlers = [cookie_handler] #+others, we have proxy + progress handlers
opener = apply(urllib2.build_opener, tuple(handlers + [_FeedURLHandler()])) #see http://code.google.com/p/feedparser/source/browse/trunk/feedparser/feedparser.py#2848 for implementation of _FeedURLHandler
opener.addheaders = [] #may not be needed but see the comments around the link referred to below
try:
    return opener.open(request) #see http://code.google.com/p/feedparser/source/browse/trunk/feedparser/feedparser.py#2954 for implementation of request
finally:
    opener.close()

回答于 2025-04-16 由 Python大师

分享举报

这要看重定向是怎么实现的。如果是通过 HTTP 刷新来重定向的，那么 mechanize 有一个叫做 HTTPRefreshProcessor 的工具可以使用。你可以试着像下面这样创建一个打开器：

cj = mechanize.CookieJar()
opener = mechanize.build_opener(
    mechanize.HTTPCookieProcessor(cj),
    mechanize.HTTPRefererProcessor,
    mechanize.HTTPEquivProcessor,
    mechanize.HTTPRefreshProcessor)

回答于 2025-04-16 由 Python大师

分享举报

我最近也遇到了同样的问题，但为了节省时间，我决定放弃这个问题，转而使用mechanize。这个工具可以完全替代urllib2，它的表现就像你期待的浏览器一样，能够处理引用头、重定向和 cookies。

import mechanize
cj = mechanize.CookieJar()
browser = mechanize.Browser()
browser.set_cookiejar(cj)
browser.set_proxies({'http': '127.0.0.1:8888'})

# Use browser's handlers to create a new opener
opener = mechanize.build_opener(*browser.handlers)

Browser对象本身可以用来打开网页（通过.open()方法）。它内部会保持状态，并且每次调用都会返回一个响应对象。所以你可以灵活使用。

另外，如果你不需要手动查看cookiejar或者把它传给其他东西，你也可以省略创建和赋值这个对象的步骤。

我知道这并没有真正解决问题，也没有解释为什么urllib2不能直接提供这个解决方案，或者至少需要很多调整，但如果你时间紧迫，只想让它工作，那就直接用mechanize吧。

回答于 2025-04-16 由 Python大师

分享举报

使用urllib2时302重定向丢失Cookies

4 个回答

撰写回答