Pandas的read_csv在小文件上总是崩溃

8 投票

2 回答

4709 浏览

提问于 2025-04-18 17:44

我正在尝试在Python中使用Panda导入一个相对较小的csv文件进行分析。这个文件有217行，87列，大小大约是15千字节。虽然文件的结构不是很好，但我还是想导入它，因为这是原始数据，我不想在Python之外手动处理（比如用Excel）。不幸的是，每次尝试都会导致崩溃，提示“内核似乎已经死了。它将自动重启”。

https://www.wakari.io/sharing/bundle/uniquely/ReadCSV

我做了一些研究，发现使用read_csv时可能会出现崩溃，但通常是针对非常大的文件，所以我不明白问题出在哪里。无论是在本地安装（Anaconda 64位，IPython（Py 2.7）Notebook）还是在Wakari上，崩溃情况都是一样的。

有没有人能帮我一下？非常感谢！

代码：

# I have a somehow ugly, illustrative csv file, but it is not too big, 217 rows, 87 colums.
# File can be downloaded at http://www.win2day.at/download/lo_1986.csv

# In[1]:

file_csv = 'lo_1986.csv'
f = open(file_csv, mode="r")
x = 0
for line in f:
    print x, ": ", line
    x = x + 1
f.close()


# Now I'd like to import this csv into Python using Pandas - but this always lead to a crash:
# "The kernel appears to have died. It will restart automatically."

# In[ ]:

import pandas as pd
pd.read_csv(file_csv, delimiter=';')

# What am I doing wrong?

数据处理 ipython 数据分析数据导入 pandas库 csv文件 Anaconda 内核崩溃

2 个回答

非常感谢你的评论。我完全同意，这个csv文件确实很乱。不过不幸的是，这就是奥地利国家彩票分享他们的开奖信息和奖金数据的方式。

我继续尝试，查看了一些特殊字符。最后，对我来说，解决办法竟然出乎意料的简单：

pd.read_csv(file_csv, delimiter=';', encoding='latin-1', engine='python')

添加的编码帮助正确显示了特殊字符，但真正改变游戏的是引擎参数。老实说，我不太明白为什么，但现在它可以正常工作了。

再次感谢！

回答于 2025-04-18 由 Python大师

分享举报

这是因为文件中有无效字符（比如0xe0）。

如果你在调用read_csv()的时候加上encoding参数，你就会看到这个错误信息，而不是程序崩溃。

>>> df = pandas.read_csv("/tmp/lo_1986.csv", delimiter=";", encoding="utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 400, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 205, in _read
    return parser.read()
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 608, in read
    ret = self._engine.read(nrows)
  File "/Users/antkong/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1028, in read
    data = self._reader.read(nrows)
  File "parser.pyx", line 706, in pandas.parser.TextReader.read (pandas/parser.c:6745)
  File "parser.pyx", line 728, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:6964)
  File "parser.pyx", line 804, in pandas.parser.TextReader._read_rows (pandas/parser.c:7780)
  File "parser.pyx", line 890, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:8793)
  File "parser.pyx", line 950, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:9484)
  File "parser.pyx", line 1026, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:10642)
  File "parser.pyx", line 1051, in pandas.parser.TextReader._string_convert (pandas/parser.c:10905)
  File "parser.pyx", line 1278, in pandas.parser._string_box_utf8 (pandas/parser.c:15657)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 0: unexpected end of data

你可以先处理一下，去掉这些字符，然后再让pandas读取这个文件。

这里附了一张图片，标出了文件中的无效字符。

enter image description here

回答于 2025-04-18 由 Python大师

分享举报

Pandas的read_csv在小文件上总是崩溃

2 个回答

撰写回答