我正在抓取数据并将其放入数据库。问题是只保存了一个对象,而我知道收集了大约40个对象
如何让脚本保存所有对象
class PresstvPipeline(object):
def __init__(self):
engine = db_connect()
create_presstv_table(engine)
self.Session = sessionmaker(bind=engine)
def process_item(self, items, spider):
session = self.Session()
for title, link, date in zip(items['title'], items['link'], items['date']):
print(title, link, date)
item = Presstv(title = title, link = link, date = date)
if session.query(Presstv).filter_by(link=item.link).first() == None:
try:
session.add(item)
session.commit()
logger.info('Item saved')
except:
session.rollback()
raise
finally:
session.close()
return item
presstv_url = "http://www.url.ir/Default/Section/1"
presstv_xpath = '//html/body/section/div/div/section/div[2]/section/ul'
presstv_pipeline = PresstvPipeline()
def presstv_extract_item(element):
return {
'title': element.xpath('li/div/div/p/text()'),
'link': element.xpath('li/div/div/a/@href'),
'date': element.xpath('li/div/div/div/text()'),
}
def spider_html(input_url, extract_function, input_xpath, pipeline):
tree = lxml.html.parse(input_url)
for element in tree.xpath(input_xpath):
pipeline.process_item(extract_function(element), None)
presstv = spider_html(presstv_url, presstv_extract_item, presstv_xpath, presstv_pipeline)
您将在for循环中关闭会话,因此在后续迭代中不会发生任何事情。实际上比这更糟糕,因为您在循环中返回项,这意味着循环甚至不会执行其余的迭代。将回滚/关闭处理移到循环外部。也将回位器移到外面。您不需要回滚,因为会话仍将关闭
相关问题 更多 >
编程相关推荐