脏兮兮的爬虫不在家工作pag

from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import Selector from .. import items class GinakSpider(CrawlSpider): name = "ginak" start_urls = [ "http://www.shop.ginakdesigns.com/main.sc" ] rules = [Rule(SgmlLinkExtractor(allow=[r'category\.sc\?categoryId=\d+'])), Rule(SgmlLinkExtractor(allow=[r'product\.sc\?productId=\d+&categoryId=\d+']), callback='parse_item')] def parse_item(self, response): sel = Selector(response) self.log(response.url) item = items.GinakItem() item['name'] = sel.xpath('//*[@id="wrapper2"]/div/div/div[1]/div/div/div[2]/div/div/div[1]/div[1]/h2/text()').extract() item['price'] = sel.xpath('//*[@id="listPrice"]/text()').extract() item['description'] = sel.xpath('//*[@id="wrapper2"]/div/div/div[1]/div/div/div[2]/div/div/div[1]/div[4]/div/p/text()').extract() item['category'] = sel.xpath('//*[@id="breadcrumbs"]/a[2]/text()').extract() return item

1条回答

网友

1楼 · 发布于 2024-04-20 13:19:46

问题是有jsessionid插入到您试图提取的链接中，例如：

<a href="/category.sc;jsessionid=EA2CAA7A3949F4E462BBF466E03755B7.m1plqscsfapp05?categoryId=16">

通过对任何字符使用.*?非贪婪匹配而不是查找/?来修复它：

rules = [Rule(SgmlLinkExtractor(allow=[r'category\.sc.*?categoryId=\d+']), callback='parse_item'),
         Rule(SgmlLinkExtractor(allow=[r'product\.sc.*?productId=\d+&categoryId=\d+']), callback='parse_item')]

希望有帮助。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章