使用Scrapy进行网页抓取的NTLM认证

10 投票
2 回答
2773 浏览
提问于 2025-04-18 11:06

我正在尝试从一个需要登录的网站抓取数据。
我已经成功使用requests和HttpNtlmAuth登录,代码如下:

s = requests.session()     
url = "https://website.com/things"                                                      
response = s.get(url, auth=HttpNtlmAuth('DOMAIN\\USERNAME','PASSWORD'))

我想尝试使用Scrapy这个工具,但我还没有成功进行身份验证。

我发现了一个中间件,看起来可以用,但我觉得我可能没有正确使用它:

https://github.com/reimund/ntlm-middleware/blob/master/ntlmauth.py

在我的settings.py文件中,我有:

SPIDER_MIDDLEWARES = { 'test.ntlmauth.NtlmAuthMiddleware': 400, }

在我的爬虫类中,我有:

http_user = 'DOMAIN\\USER'
http_pass = 'PASS'

我还没有成功让这个工作。
如果有人能告诉我如何从需要NTLM身份验证的网站抓取数据,我将非常感激。

2 个回答

5

感谢@SpaceDog上面的评论,我遇到了类似的问题,想要爬取一个内部网站,使用的是ntlm认证。结果爬虫只能看到首页,因为CrawlSpider中的LinkExtractor没有启动。

这是我在使用scrapy 1.0.5时找到的有效解决方案。

NTLM_Middleware.py

from scrapy.http import Response, HtmlResponse
import requests
from requests_ntlm import HttpNtlmAuth

class NTLM_Middleware(object):

    def process_request(self, request, spider):
        url = request.url
        usr = getattr(spider, 'http_usr', '')
        pwd = getattr(spider, 'http_pass','')
        s = requests.session()
        response = s.get(url, auth=HttpNtlmAuth(usr,pwd))
        return HtmlResponse(url,response.status_code, response.headers.iteritems(), response.content)

settings.py

import logging

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'scrapy intranet'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS=16


# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'intranet.NTLM_Middleware.NTLM_Middleware': 200,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':None
}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'scrapyelasticsearch.scrapyelasticsearch.ElasticSearchPipeline',
}

ELASTICSEARCH_SERVER='localhost'
ELASTICSEARCH_PORT=9200
ELASTICSEARCH_USERNAME=''
ELASTICSEARCH_PASSWORD=''
ELASTICSEARCH_INDEX='intranet'
ELASTICSEARCH_TYPE='pages_intranet'
ELASTICSEARCH_UNIQ_KEY='url'
ELASTICSEARCH_LOG_LEVEL=logging.DEBUG

spiders/intranetspider.py

# -*- coding: utf-8 -*-
import scrapy

#from scrapy import log
from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.http import Response

import requests
import sys
from bs4 import BeautifulSoup

class PageItem(scrapy.Item):
    body=scrapy.Field()
    title=scrapy.Field()
    url=scrapy.Field()

class IntranetspiderSpider(CrawlSpider):
    http_usr='DOMAIN\\user'
    http_pass='pass'
    name = "intranetspider"
    protocol='https://'
    allowed_domains = ['intranet.mydomain.ca']
    start_urls = ['https://intranet.mydomain.ca/']
    rules = (Rule(LinkExtractor(),callback="parse_items",follow=True),)

    def parse_items(self, response):
        self.logger.info('Crawl de la page %s',response.url)
        item = PageItem()

        soup = BeautifulSoup(response.body)

        #remove script tags and javascript from content
        [x.extract() for x in soup.findAll('script')]

        item['body']=soup.get_text(" ", strip=True)
        item['url']=response.url

        return item
6

我终于搞清楚发生了什么。

1: 这被称为“下载中间件”,而不是“爬虫中间件”。

DOWNLOADER_MIDDLEWARES = { 'test.ntlmauth.NTLM_Middleware': 400, }

2: 我尝试使用的中间件需要进行大幅修改。下面是对我有效的修改:

from scrapy.http import Response
import requests                                                              
from requests_ntlm import HttpNtlmAuth

class NTLM_Middleware(object):

    def process_request(self, request, spider):
        url = request.url
        pwd = getattr(spider, 'http_pass', '')
        usr = getattr(spider, 'http_user', '')
        s = requests.session()     
        response = s.get(url,auth=HttpNtlmAuth(usr,pwd))      
        return Response(url,response.status_code,{}, response.content)

在爬虫里,你只需要设置这些变量:

http_user = 'DOMAIN\\USER'
http_pass = 'PASS'

撰写回答