Scrapinghub mySQL Pipelin

2024-04-18 14:06:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试创建一个废弃的管道,将刮取的数据导出到mySQL数据库。我已经写好剧本了(管道.py)公司名称:

from datetime import date time
from hashlib import md5
from scrapy import log
from scrapy.exceptions import DropItem
from twisted.enterprise import adbapi

class mySQLStorePipeline(object):

def __init__(self, dbpool):
    self.dbpool = dbpool

@classmethod
def from_settings(cls, settings):
    dbargs = dict(
        host=settings['HIDDEN'],
        db=settings['parsedjobs'],
        user=settings['scrapinghub'],
        passwd=settings['HIDDEN'],
        charset='utf8',
        use_unicode=True,
    )
    dbpool = adbapi.ConnectionPool('MySQLdb', **dbargs)
    return cls(dbpool)

def process_item(self, item, spider):
    d = self.dbpool.runInteraction(self._do_upsert, item, spider)
    d.addErrback(self._handle_error, item, spider)
    d.addBoth(lambda _: item)
    return d

def _do_upsert(self, conn, item, spider):
    """Perform an insert or update."""
    sn = self.spider.name
    guid = self._get_guid(item)
    now = datetime.utcnow().replace(microsecond=0).isoformat(' ')

    conn.execute("""SELECT EXISTS(
        SELECT 1 FROM masterjobs WHERE guid = %s
    )""", (guid, ))
    ret = conn.fetchone()[0]

    if ret:
        conn.execute("""
            UPDATE masterjobs
            SET name=%s, website=%s, description=%s, url=%s, updated=%s
            WHERE guid=%s
        """, (item['name'], sn, item['description'], item['url'], now, guid))
        spider.log("Item updated in db: %s %r" % (guid, item))
    else:
        conn.execute("""
            INSERT INTO masterjobs (guid, website, name, description, url, updated)
            VALUES (%s, %s, %s, %s, %s, %s)
        """, (guid, sn, item['name'], item['description'], item['url'], now))
        spider.log("Item stored in db: %s %r" % (guid, item))

def _handle_error(self, failure, item, spider):
    """Handle occurred on db interaction."""
    log.err(failure)

def _get_guid(self, item):
    """Generates an unique identifier for a given item."""
    return md5(item['url']).hexdigest()

我想把这一切变成一个鸡蛋,这样就可以上传到Scrapinghub。我该怎么做?我写了一篇设置.py文件,并尝试打包它,但我总是得到一个错误,它找不到包。在


Tags: namefromimportselflogurldbsettings