使用Python NTLM浏览受NTLM保护的网站

Question

我被要求写一个脚本，这个脚本需要登录到一个公司门户网站，进入特定的页面，下载这个页面，然后把它和之前的版本进行比较，根据变化给某个人发邮件。后面的步骤相对简单，但第一步让我遇到了很多麻烦。

我尝试用urllib2（我想用Python来做这件事）连接，但失败了，花了大约4到5个小时在网上查资料，最后我发现我无法连接的原因是网页使用了NTLM认证。我试了很多不同的连接方法，都是无济于事。根据这个NTLM示例，我做了以下尝试：

import urllib2
from ntlm import HTTPNtlmAuthHandler

user = 'username'
password = "password"
url = "https://portal.whatever.com/"

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, user, password)
# create the NTLM authentication handler
auth_NTLM = HTTPNtlmAuthHandler.HTTPNtlmAuthHandler(passman)

# create and install the opener
opener = urllib2.build_opener(auth_NTLM)
urllib2.install_opener(opener)

# create a header
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
header = { 'Connection' : 'Keep-alive', 'User-Agent' : user_agent}

response = urllib2.urlopen(urllib2.Request(url, None, header))

当我运行这个（用真实的用户名、密码和网址）时，我得到了以下结果：

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "ntlm2.py", line 21, in <module>
    response = urllib2.urlopen(urllib2.Request(url, None, header))
  File "C:\Python27\lib\urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "C:\Python27\lib\urllib2.py", line 400, in open
    response = meth(req, response)
  File "C:\Python27\lib\urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python27\lib\urllib2.py", line 432, in error
    result = self._call_chain(*args)
  File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 619, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "C:\Python27\lib\urllib2.py", line 400, in open
    response = meth(req, response)
  File "C:\Python27\lib\urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python27\lib\urllib2.py", line 432, in error
    result = self._call_chain(*args)
  File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
    result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 619, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "C:\Python27\lib\urllib2.py", line 400, in open
    response = meth(req, response)
  File "C:\Python27\lib\urllib2.py", line 513, in http_response
    'http', request, response, code, msg, hdrs)
  File "C:\Python27\lib\urllib2.py", line 438, in error
     return self._call_chain(*args)
  File "C:\Python27\lib\urllib2.py", line 372, in _call_chain
     result = func(*args)
  File "C:\Python27\lib\urllib2.py", line 521, in http_error_default
     raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
  urllib2.HTTPError: HTTP Error 401: Unauthorized

让我觉得最有趣的是，最后一行显示返回了一个401错误。根据我所阅读的内容，401错误是NTLM认证开始时返回给客户端的第一个消息。我原以为python-ntlm的目的是为了帮我处理NTLM的过程。这个理解是错的吗，还是我使用的方法不对？另外，我并不局限于使用Python，如果用其他语言有更简单的方法，请告诉我（根据我查的资料，似乎没有）。谢谢！

自动化脚本认证机制网络请求邮件通知数据比较网站爬虫 401错误 ntlm认证

使用Python NTLM浏览受NTLM保护的网站

1 个回答

撰写回答