希望能给我一些关于如何优化代码的建议。。。理想情况下,我希望继续使用pandas,但是假设我可以使用一些漂亮的sqlite技巧来获得一些好的加速。对于其他的“要点”,我想知道Cython是否能帮上忙?在
因为从代码上看不出来。。对于上下文,我必须写出数百万个非常小的sqlite文件(“uncompressedDir”中的文件),并将它们输出到一个更大的“master”sqlite DB(“第6个一月数据库"). 在
提前谢谢大家!在
%%cython -a
import os
import pandas as pd
import sqlite3
import time
import sys
def main():
rootDir = "/Users/harryrobinson/Desktop/dataForMartin/"
unCompressedDir = "/Users/harryrobinson/Desktop/dataForMartin/unCompressedSqlFiles/"
with sqlite3.connect(rootDir+'6thJan.db') as conn:
destCursor = conn.cursor()
createTable = "CREATE TABLE IF NOT EXISTS userData(TimeStamp, Category, Action, Parameter1Name, Parameter1Value, Parameter2Name, Parameter2Value, formatVersion, appVersion, userID, operatingSystem)"
destCursor.execute(createTable)
for i in os.listdir(unCompressedDir):
try:
with sqlite3.connect(unCompressedDir+i) as connection:
cursor = connection.cursor()
cursor.execute('SELECT * FROM Events')
df_events = pd.DataFrame(cursor.fetchall())
cursor.execute('SELECT * FROM Global')
df_global = pd.DataFrame(cursor.fetchall())
cols = ['TimeStamp', 'Category', 'Action', 'Parameter1Name', 'Parameter1Value', 'Parameter2Name', 'Parameter2Value']
df_events = df_events.drop(0,axis=1)
df_events.columns = cols
df_events['formatVersion'] = df_global.iloc[0,0]
df_events['appVersion'] = df_global.iloc[0,1]
df_events['userID'] = df_global.iloc[0,2]
df_events['operatingSystem'] = df_global.iloc[0,3]
except Exception as e:
print(e, sys.exc_info()[-1].tb_lineno)
try:
df_events.to_sql("userData", conn, if_exists="append", index=False)
except Exception as e:
print("Sqlite error, {0} - line {1}".format(e, sys.exc_info()[-1].tb_lineno))
更新:通过添加事务而不是添加到\u-sql,将时间缩短了一半
重新考虑使用Pandas作为登台工具(离开库进行数据分析)。只需编写纯SQL查询,可以使用SQLite的ATTACH查询外部数据库。在
相关问题 更多 >
编程相关推荐