CrawlerRunner()未通过scrapy的管道文件

2024-06-16 09:05:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我正试图从Django Views.py文件调用scrapy spider。spider确实会被调用,但其输出会显示在命令提示符中,并且不会保存在Django模型中以将其呈现到页面上。我分别检查了运行spider以验证scrapy和Django是否已连接且工作正常,但在使用rrunner()自动运行时因此,Django views.py文件中的CrawlerRunner()实现中缺少某些组件。 下面是调用spider的Django Views.py文件:

@csrf_exempt
@require_http_methods(['POST', 'GET'])
def scrape(request):
import sys
from newscrawler.spiders import news_spider
from newscrawler.pipelines import NewscrawlerPipeline
from scrapy import signals
from twisted.internet import reactor
from scrapy.crawler import Crawler,CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from crochet import setup

setup()
configure_logging()

runner= CrawlerRunner(get_project_settings())
d=runner.crawl(news_spider.NewsSpider)

return redirect("../getnews/")

“我的蜘蛛”负责抓取新闻网站,并保存顶级新闻的Url、图像和标题。但输出结果是,它并没有将这三个字段保存在models.py文件中,而是打印在cmd中。 有人能帮忙吗

来自scrapy的项目文件

import scrapy
from scrapy_djangoitem import DjangoItem

import sys

import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'News_Aggregator.settings'

from news.models import Headline

class NewscrawlerItem(DjangoItem):
    # define the fields for your item here like:
    django_model = Headline

管道文件

class NewscrawlerPipeline(object):
    def process_item(self, item, spider):
        print("10000000000000000")
        item.save()
        return item

Tags: 文件djangofrompyimportprojectsettingsdef
1条回答
网友
1楼 · 发布于 2024-06-16 09:05:25

我发现CrawlerRunner无法访问我的scrapy项目的设置文件,该项目可以启用scrapy的pipelines.py,从而将数据保存在Django模型文件中。Django的views.py文件(调用spider)的修改代码为:

import os
import sys
from newscrawler.spiders import news_spider
from newscrawler.pipelines import NewscrawlerPipeline
from scrapy import signals
from twisted.internet import reactor
from scrapy.crawler import Crawler,CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from newscrawler import settings as my_settings 
from scrapy.utils.log import configure_logging
from crochet import setup

@csrf_exempt
@require_http_methods(['POST', 'GET'])
def scrape(request):
    Headline.objects.all().delete()
    crawler_settings = Settings()

    setup()
    configure_logging()
    crawler_settings.setmodule(my_settings)
    runner= CrawlerRunner(settings=crawler_settings)
    d=runner.crawl(news_spider.NewsSpider)
    time.sleep(8)
    return redirect("../getnews/")

希望这能帮助任何想从django views.py文件中调用scrapy spider并将这些数据保存到django模型中的人。谢谢

相关问题 更多 >