Pandas read_csv():将0保留为0(不转换为NaN)

2024-06-11 14:22:20 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图读取一个csv文件,其中有一个示例:

datetime,check,lat,lon,co_alpha,atn,status,bc
2012-10-27 15:00:59,2,0,0,2.427,,,
2012-10-27 15:01:00,2,0,0,2.407,,,
2012-10-27 15:02:49,2,0,0,2.207,-17.358,0,-16162
2012-10-27 15:02:50,2,0,0,2.207,-17.354,0,8192
2012-10-27 15:02:51,1,0,0,2.207,-17.358,0,-8152
2012-10-27 15:02:52,1,0,0,2.207,-17.358,0,648
2012-10-27 15:06:03,0,51.195076,4.444407,2.349,-17.289,0,4909
2012-10-27 15:06:04,0,51.195182,4.44427,2.344,-17.289,0,587
2012-12-05 09:21:34,,,,,42.960,1,16430
2012-12-05 09:21:35,,,,,42.962,1,3597

我遇到的问题是,在只有int的列中,0被转换为NaN(例如列'check'和'status',这些列只包含int,但该列被读取为float,因为有实际的缺失值)。但我只希望将空值转换为NaN,而不是0。在

我得到的是:

^{pr2}$

因此,在'check'和'status'列中,有许多NaN's。在'lat'和'lon'列中,0不被转换成NaN's

  • 使用na_values=''和{}没有帮助。有没有办法指定不将int 0转换为NaN?或者这是个虫子?

  • 我可以用dtype关键字将特定列的数据类型指定为int。这将0保持为0,但问题是这些列也包含真正的NaN(空值)。所以,在这个例子中,这些值也被转换成0,就像在int列中不能有NaN一样。


编辑:升级到pandas 0.10.1后,即使不指定keep_default_nana_values,它也能正常工作:

>>> pd.read_clipboard(sep=',', parse_dates=True, index_col=0)
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358       0 -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354       0   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358       0  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358       0    648
2012-10-27 15:06:03      0  51.195076  4.444407     2.349 -17.289       0   4909
2012-10-27 15:06:04      0  51.195182  4.444270     2.344 -17.289       0    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597

Tags: csvalphadatetimecheckstatusnanintvalues
1条回答
网友
1楼 · 发布于 2024-06-11 14:22:20

您必须首先将keep_default_na设置为False

df = pd.read_clipboard(sep=',', index_col=0, keep_default_na=False, na_values='')

In [2]: df
Out[2]: 
                     check        lat       lon  co_alpha     atn  status     bc
datetime                                                                        
2012-10-27 15:00:59      2   0.000000  0.000000     2.427     NaN     NaN    NaN
2012-10-27 15:01:00      2   0.000000  0.000000     2.407     NaN     NaN    NaN
2012-10-27 15:02:49      2   0.000000  0.000000     2.207 -17.358       0 -16162
2012-10-27 15:02:50      2   0.000000  0.000000     2.207 -17.354       0   8192
2012-10-27 15:02:51      1   0.000000  0.000000     2.207 -17.358       0  -8152
2012-10-27 15:02:52      1   0.000000  0.000000     2.207 -17.358       0    648
2012-10-27 15:06:03      0  51.195076  4.444407     2.349 -17.289       0   4909
2012-10-27 15:06:04      0  51.195182  4.444270     2.344 -17.289       0    587
2012-12-05 09:21:34    NaN        NaN       NaN       NaN  42.960       1  16430
2012-12-05 09:21:35    NaN        NaN       NaN       NaN  42.962       1   3597

^{}的文档字符串:

keep_default_na : bool, default True
     If na_values are specified and keep_default_na is False the default NaN
    values are overridden, otherwise they're appended to

na_values : list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

相关问题 更多 >