使用pandas read_csv时未加载值
当我运行以下代码时:
import pandas as pd
with open('data/training.csv', 'r') as f:
data2 = pd.read_csv(f, sep='\t', index_col=0)
EventID = pd.date_range('1/1/2000', periods=250000)
df = pd.DataFrame(data2, index=EventID, columns=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])
print df[:3]
print(data2)
我得到的输出是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 \
2000-01-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2000-01-02 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2000-01-03 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17 18 19 20
2000-01-01 NaN NaN NaN NaN ...
2000-01-02 NaN NaN NaN NaN ...
2000-01-03 NaN NaN NaN NaN ...
我知道CSV文件里的值并不是全是“NaN”,那么为什么输出会是这样呢?我该如何才能得到正确的输出,显示每一行的数字呢?
当我把“EventID”这一行和添加“columns”的那一行注释掉时:
import pandas as pd
with open('data/training.csv', 'r') as f:
df = pd.read_csv(f, sep='\t', index_col=0)
# EventID = pd.date_range('1/1/2000', periods=250000)
# df = pd.DataFrame(data2, index=EventID, columns=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])
print df[:3]
我在终端得到的输出是:
/usr/bin/python2.7 /home/amit/PycharmProjects/HB/Read.py
Empty DataFrame
Columns: []
Index: [100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,32.638,1.017,0.381,51.626,2.273,-2.414,16.824,-0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.00265331133733,s, 100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,125.157,0.879,1.414,-999.0,42.014,2.039,-3.011,36.918,0.501,0.103,44.704,-1.916,164.546,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.23358448717,b, 100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,197.814,3.776,1.414,-999.0,32.154,-0.705,-2.093,121.409,-0.953,1.052,54.283,-2.186,260.414,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251,2.34738894364,b]
[3 rows x 0 columns]
Process finished with exit code 0
我不太明白“3行0列”是什么意思。
1 个回答
1
我不知道你的数据具体是什么样的,但我就根据提问者提供的信息来讲解:
In [76]:
%%file temp.csv
100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,197.76,1.582,1.396,0.2,32.638,1.017,0.381,51.626,2.273,-2.414,16.824,-0.277,258.733,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.00265331133733,s, 100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,125.157,0.879,1.414,-999.0,42.014,2.039,-3.011,36.918,0.501,0.103,44.704,-1.916,164.546,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.23358448717,b, 100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,197.814,3.776,1.414,-999.0,32.154,-0.705,-2.093,121.409,-0.953,1.052,54.283,-2.186,260.414,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251,2.34738894364,b
In [77]:
#make sure it is tab delimited rather than , delimited
#Change pd.DataFrame(data2 to pd.DataFrame(data2.values
with open('temp.csv', 'r') as f:
data2 = pd.read_csv(f, sep=',', index_col=0, header=None)
EventID = pd.date_range('1/1/2000', periods=1)
df = pd.DataFrame(data2.values, index=EventID, columns=range(98))
print df[:3]
0 1 2 3 4 5 6 7 \
2000-01-01 138.47 51.655 97.827 27.98 0.91 124.711 2.666 3.064
8 9 ... 88 89 90 91 92 93 94 \
2000-01-01 41.928 197.76 ... 1 44.251 2.053 -2.028 -999 -999 -999
95 96 97
2000-01-01 44.251 2.347389 b
[1 rows x 98 columns]
pd.DataFrame(data2.values
是这里的关键。data2
是一个 DataFrame
,它有自己的一套索引。现在你想把它放进一个新的 DataFrame
,并且给它一个新的时间序列索引。pandas
会尝试把原来的索引和新的索引对齐,但因为没有匹配的索引,所以会出现问题。
因此,pd.DataFrame(data2...
会生成一个充满 nan
的 DataFrame
。解决办法是把数据的值以 numpy.array
的形式传给构造函数,应该用 pd.DataFrame(data2.value...
这样的写法。