我想从这个页面上抓取电影标题:https://www.imdb.com/list/ls055386972/。我编写了以下代码:
import scrapy
from scrapy import Spider
from scrapy.http import Request
import re
import pymysql
import sys
import hashlib
from datetime import *
#import time
import csv
import os
import requests
class MoviesSpider(scrapy.Spider):
name = 'movies' #name of the spider
allowed_domains = ['imdb.com/list/ls055386972/']
start_urls = ['http://imdb.com/list/ls055386972//']
def parse(self, response):
#events = response.xpath('//*[@property="url"]/@href').extract()
links = response.xpath('//h3[@class]/a/@href').extract()
final_links = []
for link in links:
final_link = 'http://www.imdb.com' + link
final_links.append(final_link)
for final_link in final_links:
absolute_url = response.urljoin(final_link)
yield Request(absolute_url, callback = self.parse_movies)
#process next page url
#next_page_url = response.xpath('//a[text() = "Next"]/@href').extract_first()
#absolute_next_page_url = response.urljoin(next_page_url)
#yield Request(absolute_next_page_url)
def parse_movies(self, response):
title = response.xpath('//div[@class = "title_wrapper"]/h1[@class]/text()').extract_first()
yield{
'title': title,
}
但它不会刮到任何东西。我收到以下错误消息:
2019-03-04 18:08:37 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://www.imdb.com/list/ls055386972//> (referer: None)
2019-03-04 18:08:37 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <404 https://www.imdb.com/list/ls055386972//>: HTTP status code is not handled or not allowed
2019-03-04 18:08:37 [scrapy.core.engine] INFO: Closing spider (finished)
打印(“最终链接”)生成到各个电影页的正确链接:
[u'https://www.imdb.com/title/tt0068646/?ref_=ttls_li_tt', u'https://www.imdb.com/title/tt0108052/?ref_=ttls_li_tt', u'https://www.imdb.com/title/tt0050083/?ref_=ttls_li_tt', u'https://www.imdb.com/title/tt0118799/?ref_=ttls_li_tt', u'https://www.imdb.com/title/tt0060196/?ref_=ttls_li_tt',..........]
不确定关于Scrapy,但如果您使用下面的代码,您将得到所需的输出。你知道吗
您还没有解析解析函数的起始url,下面是工作代码。你知道吗
你得到一个404,因为你的起始网址是不正确的。您需要删除
start_urls
中的尾部正斜杠:另外,您的
allowed_domains
不正确。它应该只包含域,而不是部分URL:请参阅documentation。你知道吗
相关问题 更多 >
编程相关推荐