使用Urllib和Scrapy进行分页

import datetime import urllib.request import urllib.error import urllib.parse import socket import scrapy from scrapy.loader.processors import MapCompose, Join from scrapy.loader import ItemLoader from properties.items import PropertiesItem class BasicSpider(scrapy.Spider): name = "manual" allowed_domains = ["web"] # Start on the first index page start_urls = ( 'http://scrapybook.s3.amazonaws.com/properties/index_00000.html', ) def parse(self, response): # Get the next index URLs and yield Requests next_selector = response.xpath('//*[contains(@class,"next")]//@href') for url in next_selector.extract(): yield Request(urllib.parse.urljoin(response.url, url)) # Get item URLs and yield Requests item_selector = response.xpath('//*[@itemprop="url"]/@href') for url in item_selector.extract(): yield Request(urllib.parse.urljoin(response.url, url), callback=self.parse_item) def parse(self, response): l = ItemLoader(item=PropertiesItem(), response=response) l.add_xpath('title', '//*[@itemprop="name"]/text()') return l.load_item()

2条回答

网友

1楼 · 编辑于 2024-04-26 02:31:56

您似乎有两个parse函数。所以你只有第二个，因为它覆盖了第一个。在

只需将第二个重命名为parse_item，就像其他代码所示的那样。在

网友

2楼 · 编辑于 2024-04-26 02:31:56

让我们从Python包的错误用法开始

在不导入的情况下使用请求，通过修复。在
来自废弃的导入请求
错误地使用了urllib中的urljoin类，请先导入它
来自urllib.parse导入urljoin
现在使用urljoindirect而不调用urllib.parse.urljoin
打开它
屈服请求(urllib.parse.urljoin(响应.url，网址）屈服请求(urllib.parse.urljoin(响应.url，url），回叫=self.parse_项)
不调用parse_item
打电话给我
def parse（self，response）：\replace parse to parse_item

PS：如果这段代码来自于学习Scrapy Book，那么这里是python3的完整git示例

https://github.com/Rahulsharma0810/Scrapy-Pagination-URLJOIN-Example

相关问题更多 >

编程相关推荐

热门问题

热门文章