干得脏兮兮的不起作用

2024-06-16 18:21:48 发布

您现在位置:Python中文网/ 问答频道 /正文

它不从标题中收集数据。我在样本中做了,但仍然不起作用。这是我的代码:

托斯特.py:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from toster.items import DjangoItem

class DjangoSpider(CrawlSpider):
name = "django"
allowed_domains = ["www.toster.ru"]
start_urls = [
'http://www.toster.ru/tag/django/questions',
]

rules = [
    Rule(LinkExtractor(
        allow=['/tag/django/questions\?page=\d']),
        callback='parse_item',
        follow=True)
]


def parse_item(self, response):

    selector_list = response.css('div.thing')

    for selector in selector_list:
        item = DjangoItem()
        item['title'] = selector.xpath('div/h2/a/text()').extract()

        yield item

有什么帮助吗


Tags: djangofromimporttagwwwruitemcontrib
1条回答
网友
1楼 · 发布于 2024-06-16 18:21:48

代码中存在多个问题:

  • allowed_domains中删除www.
  • 修复链接提取器中的正则表达式-将\d替换为\d+
  • 设置unique=False让Scrapy处理分页页
  • 修复parse_item()中的提取逻辑—例如,这些页面上没有带有thing类的元素

固定版本(适用于我):

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from toster.items import DjangoItem


class DjangoSpider(CrawlSpider):
    name = "django"
    allowed_domains = ["toster.ru"]
    start_urls = [
        'http://www.toster.ru/tag/django/questions',
    ]

    rules = [
        Rule(LinkExtractor(
            allow=['/tag/django/questions\?page=\d+'], unique=False),
            callback='parse_item',
            follow=True,)
    ]

    def parse_item(self, response):
        selector_list = response.css('div.question__content')

        for selector in selector_list:
            item = DjangoItem()
            item['title'] = selector.css('a.question__title-link[itemprop=url]::text').extract_first().strip()

            yield item

相关问题 更多 >