pandas to_sql 截断了我的数据

3 投票

2 回答

7378 浏览

提问于 2025-04-18 00:05

我在使用 df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql') 把一个数据框导出到 MySQL 数据库时，发现那些包含长字符串的列（比如网址）被截断了，只保留了63个字符。我在导出时收到了来自 ipython notebook 的以下警告：

/usr/local/lib/python2.7/site-packages/pandas/io/sql.py:248: 警告：第3行的 'url' 列数据被截断 cur.executemany(insert_query, data)

其他行也有类似的警告。

有没有什么方法可以调整一下，让数据完整地导出呢？我可以在 MySQL 中设置正确的数据结构，然后再导出到那里。但我希望能通过简单的调整，直接从 Python 就能搞定。

数据库 mysql 数据结构数据截断 pandas 数据框数据导出 ipython notebook

2 个回答

受到@joris的回答启发，我决定直接在pandas的源代码中修改，然后重新编译。

cd /usr/local/lib/python2.7/dist-packages/pandas-0.14.1-py2.7-linux-x86_64.egg/pandas/io
sudo pico sql.py

我修改了第871行

'mysql': 'VARCHAR (63)',

改成了

'mysql': 'VARCHAR (255)',

然后只重新编译了那一个文件

sudo python -m py_compile sql.py

重启我的脚本后，_to_sql()这个函数成功写入了一个表格。

我本以为重新编译会导致pandas出问题，但似乎没有。

这里是我用来把数据框写入mysql的脚本，供参考。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sqlalchemy 
from sqlalchemy import create_engine
df = pd.read_csv('10k.csv')
## ... dataframe munging
df = df.where(pd.notnull(df), None) # workaround for NaN bug
engine = create_engine('mysql://user:password@localhost:3306/dbname')
con = engine.connect().connection
df.to_sql("issues", con, 'mysql', if_exists='replace', index=True, index_label=None)

回答于 2025-04-18 由 Python大师

分享举报

如果你使用的是 pandas 0.13.1 或更早版本，那么63位数字的限制是固定的，因为代码中有这么一行：https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/sql.py#L278

作为一种解决方法，你可以尝试修改那个函数 get_sqltype：

from pandas.io import sql

def get_sqltype(pytype, flavor):
    sqltype = {'mysql': 'VARCHAR (63)',    # <-- change this value to something sufficient higher
               'sqlite': 'TEXT'}

    if issubclass(pytype, np.floating):
        sqltype['mysql'] = 'FLOAT'
        sqltype['sqlite'] = 'REAL'
    if issubclass(pytype, np.integer):
        sqltype['mysql'] = 'BIGINT'
        sqltype['sqlite'] = 'INTEGER'
    if issubclass(pytype, np.datetime64) or pytype is datetime:
        sqltype['mysql'] = 'DATETIME'
        sqltype['sqlite'] = 'TIMESTAMP'
    if pytype is datetime.date:
        sqltype['mysql'] = 'DATE'
        sqltype['sqlite'] = 'TIMESTAMP'
    if issubclass(pytype, np.bool_):
        sqltype['sqlite'] = 'INTEGER'

    return sqltype[flavor]

sql.get_sqltype = get_sqltype

然后只需使用你的代码就可以正常工作了：

df.to_sql(con=con_mysql, name='testdata', if_exists='replace', flavor='mysql')

从 pandas 0.14 开始，sql模块在后台使用了 sqlalchemy，字符串会被转换成 sqlalchemy 的 TEXT 类型，这又会被转换成 mysql 的 TEXT 类型（而不是 VARCHAR），这样就可以存储超过63位的字符串了：

engine = sqlalchemy.create_engine('mysql://scott:tiger@localhost/foo')
df.to_sql('testdata', engine, if_exists='replace')

不过，如果你仍然使用 DBAPI 连接而不是 sqlalchemy 引擎，那么这个问题依然存在，但这种选择已经不推荐使用，建议你给 to_sql 提供一个 sqlalchemy 引擎。

回答于 2025-04-18 由 Python大师

分享举报

pandas to_sql 截断了我的数据

2 个回答

撰写回答