当chunksize=100时,大(600万行)pandas df会导致内存错误,但可以轻松地保存100000个文件,而不需要chunksiz

2024-04-24 17:28:25 发布

您现在位置:Python中文网/ 问答频道 /正文

我在Pandas中创建了一个大型数据库,大约有600万行文本数据。我想将它保存为一个SQL数据库文件,但是当我试图保存它时,出现内存不足的RAM错误。我甚至把卡盘的尺寸减到100,它还是会崩溃。在

但是,如果我有一个更小版本的数据帧,有100000行,并将其保存到一个没有指定chucksize的数据库中,那么保存这个数据帧就没有问题了。在

这是我的密码

from sqlalchemy import create_engine
engine = sqlalchemy.create_engine("sqlite:///databasefile.db")
dataframe.to_sql("CS_table", engine, chunksize = 100)

我的理解是,由于它一次只处理100行,所以RAM的使用应该反映出保存100行的情况。幕后还有什么事吗?可能是多线程?在

在我运行这段代码之前,我使用的是4.8gbram,而googlecolab提供的是12.8gbram。运行上面的代码会占用所有RAM,直到环境崩溃。在

我想能够保存我的熊猫数据帧到一个SQL文件,而我的环境崩溃。我所处的环境是googlecolab。pandas datafame是2列,~600万行。每个单元格都包含了这么多文本:

"The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data."

编辑:

我在不同的阶段中断了键盘。以下是RAM中第一次跳转后键盘中断的结果

^{pr2}$

这是我在键盘崩溃前中断键盘的结果

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-24-68b60fe221fe>", line 1, in <module>
    dfAllT.to_sql("CS_table22", engine, chunksize = 100)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 2531, in to_sql
    dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 460, in to_sql
    chunksize=chunksize, dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 1174, in to_sql
    table.insert(chunksize, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 686, in insert
    exec_insert(conn, keys, chunk_iter)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 599, in _execute_insert
    conn.execute(self.table.insert(), data)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
    distilled_params,
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1468, in _handle_dbapi_exception
    util.reraise(*exc_info)
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/util/compat.py", line 129, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/base.py", line 1224, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.6/dist-packages/sqlalchemy/engine/default.py", line 547, in do_executemany
    cursor.executemany(statement, parameters)
keyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/usr/lib/python3.6/inspect.py", line 1488, in getinnerframes
    frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
  File "/usr/lib/python3.6/inspect.py", line 1446, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
    if getattr(getmodule(object, filename), '__loader__', None) is not None:
  File "/usr/lib/python3.6/inspect.py", line 739, in getmodule
    f = getabsfile(module)
  File "/usr/lib/python3.6/inspect.py", line 708, in getabsfile
    _filename = getsourcefile(object) or getfile(object)
  File "/usr/lib/python3.6/inspect.py", line 693, in getsourcefile
    if os.path.exists(filename):
  File "/usr/lib/python3.6/genericpath.py", line 19, in exists
    os.stat(path)
KeyboardInterrupt

我在飞机坠毁前又跑了一次,这似乎又产生了另一个不同的结果

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-f18004debe33>", line 1, in <module>
    dfAllT.to_sql("CS_table25", engine, chunksize = 100)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 2531, in to_sql
    dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 460, in to_sql
    chunksize=chunksize, dtype=dtype, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 1174, in to_sql
    table.insert(chunksize, method=method)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 686, in insert
    exec_insert(conn, keys, chunk_iter)
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 598, in _execute_insert
    data = [dict(zip(keys, row)) for row in data_iter]
  File "/usr/local/lib/python3.6/dist-packages/pandas/io/sql.py", line 598, in <listcomp>
    data = [dict(zip(keys, row)) for row in data_iter]
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 1823, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'KeyboardInterrupt' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 1132, in get_records
    return _fixed_getinnerframes(etb, number_of_lines_of_context, tb_offset)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 313, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/IPython/core/ultratb.py", line 358, in _fixed_getinnerframes
    records = fix_frame_records_filenames(inspect.getinnerframes(etb, context))
  File "/usr/lib/python3.6/inspect.py", line 1488, in getinnerframes
    frameinfo = (tb.tb_frame,) + getframeinfo(tb, context)
  File "/usr/lib/python3.6/inspect.py", line 1446, in getframeinfo
    filename = getsourcefile(frame) or getfile(frame)
  File "/usr/lib/python3.6/inspect.py", line 696, in getsourcefile
    if getattr(getmodule(object, filename), '__loader__', None) is not None:
  File "/usr/lib/python3.6/inspect.py", line 742, in getmodule
    os.path.realpath(f)] = module.__name__
  File "/usr/lib/python3.6/posixpath.py", line 388, in realpath
    path, ok = _joinrealpath(filename[:0], filename, {})
  File "/usr/lib/python3.6/posixpath.py", line 421, in _joinrealpath
    newpath = join(path, name)
KeyboardInterrupt
---------------------------------------------------------------------------

我尝试过的其他事情:

使用dropna删除所有none/nan值

D全部=dfAllT.applymap(str)确保我的所有值都是字符串

在dfAllT.reset_索引(drop=True,inplace=True)以确保索引没有偏离对齐。在

编辑:

就像评论中提到的那样,我现在尝试在循环中使用to-sql。在

for i in range(586147):
    print(i)
    dfAllT.iloc[i*10000:(i+1)*10000].to_sql('CS_table', engine, if_exists= 'append')

这个操作最终会吃掉我的内存,并最终导致大约中途崩溃。我想知道这是否表明sqlite在内存中保存了所有内容,以及是否有解决方法。在

编辑:

我尝试了更多的东西,更短的卡盘,在每一步之后处理引擎,并创建一个新的。最后还是吃掉了所有的公羊然后坠毁了。在

for i in range(586147):
    print(i)
    engine = sqlalchemy.create_engine("sqlite:///CSTitlesSummariesData.db")
    dfAllT.iloc[i*10:(i+1)*10].to_sql('CS_table', engine, index = False, if_exists= 'append')
    engine.dispose() 
    gc.collect 

我的想法:

所以看起来整个数据库都以某种方式保存在活动内存中。在

从pandas数据帧中生成这个数据帧是5gig(或者至少在我尝试将其转换为sqlite之前,RAM的容量是多少)。我的系统在大约12.72场演出时崩溃。我可以想象sqlite数据库比pandas dataframe占用更少的RAM。在


Tags: thetoinpypandassqllibpackages
1条回答
网友
1楼 · 发布于 2024-04-24 17:28:25

从单步执行代码Ithink开始,它是this line,读取该代码将创建一组数据帧:

chunk_iter = zip(*[arr[start_i:end_i] for arr in data_list])

看起来可能是个bug。具体来说,这发生在数据库插入之前,即准备阶段。在

你可以做的一个技巧是在内存快速增加的时候按下CTRL-C键,看看哪一行暂停(我打赌是这一行)。

用户编辑:

通过使用

explicit loop (rather than using chunk), ie. for i in range(100): df.iloc[i * 100000:(i+1):100000].to_sql(...)

这仍然会导致内存错误,但允许用户继续执行崩溃前停止的循环。在

一个更健壮的解决方案是“也许尝试一个原始连接,而不是使用SQLEngine?\ " 但是用户没有机会尝试这个

相关问题 更多 >