从Python脚本运行Scrapy
我一直在尝试通过一个Python脚本来运行scrapy,因为我需要获取数据并把它保存到我的数据库里。但是当我用scrapy命令运行时,
scrapy crawl argos
脚本运行得很好,但当我尝试按照这个链接用脚本运行时,
http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script
我遇到了这个错误。
$ python pricewatch/pricewatch.py update
Traceback (most recent call last):
File "pricewatch/pricewatch.py", line 39, in <module>
main()
File "pricewatch/pricewatch.py", line 31, in main
update()
File "pricewatch/pricewatch.py", line 24, in update
setup_crawler("argos.co.uk")
File "pricewatch/pricewatch.py", line 13, in setup_crawler
settings = get_project_settings()
File "/Library/Python/2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/project.py", line 58, in get_project_settings
settings_module = import_module(settings_module_path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
ImportError: No module named settings
我不明白为什么它找不到get_project_setting(),而用scrapy命令在终端上运行却没问题。
这是我项目的截图。
这是pricewatch.py的代码:
import commands
import sys
from database import DBInstance
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log
from spiders.argosspider import ArgosSpider
from scrapy.utils.project import get_project_settings
import settings
def setup_crawler(domain):
spider = ArgosSpider(domain=domain)
settings = get_project_settings()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
def update():
#print "Enter a product to update:"
#product = raw_input()
#print product
#db = DBInstance()
setup_crawler("argos.co.uk")
log.start()
reactor.run()
def main():
try:
if sys.argv[1] == "update":
update()
elif sys.argv[1] == "database":
#db = DBInstance()
except IndexError:
print "You must select a command from Update, Search, History"
if __name__ =='__main__':
main()
2 个回答
0
这个回答大部分是从这个回答复制过来的,我觉得它能解答你的问题,并且提供了一个不错的例子。
考虑一个项目,结构如下。
my_project/
main.py # Where we are running scrapy from
scraper/
run_scraper.py #Call from main goes here
scrapy.cfg # deploy configuration file
scraper/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
quotes_spider.py # Contains the QuotesSpider class
基本上,命令scrapy startproject scraper
是在我的项目文件夹(my_project)中执行的。我在外层的scraper文件夹里添加了一个run_scraper.py
文件,在根文件夹里添加了一个main.py
文件,并在spiders文件夹里添加了一个quotes_spider.py
文件。
我的主文件是:
from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()
我的run_scraper.py
文件是:
from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os
class Scraper:
def __init__(self):
settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
self.process = CrawlerProcess(get_project_settings())
self.spiders = QuotesSpider # The spider you want to crawl
def run_spiders(self):
self.process.crawl(self.spider)
self.process.start() # the script will block here until the crawling is finished
另外,要注意设置可能需要检查一下,因为路径需要根据根文件夹来设置(是my_project,而不是scraper)。所以在我的情况下:
SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'
等等...
2
我已经解决了这个问题。只需要把 pricewatch.py 文件放到项目的最上层目录,然后运行它,就可以了。