将pandas列转换为datetime64,包括缺少的值

2024-05-15 04:29:51 发布

您现在位置:Python中文网/ 问答频道 /正文

使用Pandas处理一些基于timeseries的数据,这些数据包含日期、数字、类别等

我遇到的问题是让pandas从CSV创建的数据帧中正确地处理我的日期/时间列。我的数据中有18个日期列,它们不是连续的,原始CSV中的未知值的字符串值为“未知”。有些列的所有单元格中都有一个有效的日期时间,并通过pandas read_csv方法正确地猜测了它们的数据类型。但是,在特定的数据样本中,有些列的所有单元格都是“未知”的,并且这些单元格被作为对象进行类型化。

加载CSV的代码如下:

self.datecols = ['Claim Date', 'Lock Date', 'Closed Date', 'Service Date', 'Latest_Submission', 'Statement Date 1', 'Statement Date 2', 'Statement Date 3', 'Patient Payment Date 1', 'Patient Payment Date 2', 'Patient Payment Date 3', 'Primary 1 Payment Date', 'Primary 2 Payment Date', 'Primary 3 Payment Date', 'Secondary 1 Payment Date', 'Secondary 2 Payment Date', 'Tertiary Payment Date']
self.csvbear = pd.read_csv(file_path, index_col="Claim ID", parse_dates=True, na_values=['Unknown'])
self.csvbear = pd.DataFrame.convert_objects(self.csvbear, convert_dates='coerce')
print self.csvbear.dtypes
print self.csvbear['Tertiary Payment Date'].values

print self.csvbear.dtypes的输出

Prac                            object
Doctor Name                     object
Practice Name                   object
Specialty                       object
Speciality Code                  int64
Claim Date              datetime64[ns]
Lock Date               datetime64[ns]
Progress Note Locked            object
Aging by Claim Date              int64
Aging by Lock Date               int64
Closed Date             datetime64[ns]
Service Date            datetime64[ns]
Week Number                      int64
Month                   datetime64[ns]
Current Insurance               object
...
Secondary 2 Deductible        float64
Secondary 2 Co Insurance      float64
Secondary 2 Member Balance    float64
Secondary 2 Paid              float64
Secondary 2 Witheld           float64
Secondary 2 Ins                object
Tertiary Payment Date          object
Tertiary Payment ID           float64
Tertiary Allowed              float64
Tertiary Deductible           float64
Tertiary Co Insurance         float64
Tertiary Member Balance       float64
Tertiary Paid                 float64
Tertiary Witheld              float64
Tertiary Ins                  float64
Length: 96, dtype: object
[nan nan nan ..., nan nan nan]
Press any key to continue . . .

如您所见,第三方付款日期列应该是datetime64数据类型,但它只是一个对象,其实际内容只是NaN(从字符串“Unknown”的read-csv函数放在这里)。

如何可靠地将所有日期列转换为将datetime64作为“未知”单元格的dtype和NaT?


Tags: csv数据selfdateobjectnanpaymentns
2条回答

如果您有一个all-nan列,它将不会被read_csv正确强制。最简单的方法就是这样做(如果一个列已经是datetime64[ns]的话,它将直接传递)。

In [3]: df = DataFrame(dict(A = Timestamp('20130101'), B = np.random.randn(5), C = np.nan))

In [4]: df
Out[4]: 
                    A         B   C
0 2013-01-01 00:00:00 -0.859994 NaN
1 2013-01-01 00:00:00 -2.562136 NaN
2 2013-01-01 00:00:00  0.410673 NaN
3 2013-01-01 00:00:00  0.480578 NaN
4 2013-01-01 00:00:00  0.464771 NaN

[5 rows x 3 columns]

In [5]: df.dtypes
Out[5]: 
A    datetime64[ns]
B           float64
C           float64
dtype: object

In [6]: df['A'] = pd.to_datetime(df['A'])

In [7]: df['C'] = pd.to_datetime(df['C'])

In [8]: df
Out[8]: 
                    A         B   C
0 2013-01-01 00:00:00 -0.859994 NaT
1 2013-01-01 00:00:00 -2.562136 NaT
2 2013-01-01 00:00:00  0.410673 NaT
3 2013-01-01 00:00:00  0.480578 NaT
4 2013-01-01 00:00:00  0.464771 NaT

[5 rows x 3 columns]

In [9]: df.dtypes
Out[9]: 
A    datetime64[ns]
B           float64
C    datetime64[ns]
dtype: object

convert_objects不会强制将列转换为datetime,除非它至少有一个非nan的东西是日期(这就是示例失败的原因)。to_datetime可能更具攻击性,因为它“知道”您确实想要转换它。

我喜欢你使用DataFrame.convert_objects的方法,比我以前尝试过的方法优雅得多。

查看API文档:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

parse_dates : boolean, list of ints or names, list of lists, or dict

If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

我想你现在正处于分析的数据争论阶段。将数据格式化为正确的格式通常是分析中最长的部分。有些事情就是表现不好,所以在特殊情况下硬编码是必要的。

因此,既然您知道哪些列不能正确解析,我建议您回到代码中,在read_csv阶段解析那些列。这里有个建议:

self.csvbear = pd.read_csv(file_path, index_col="Claim ID", parse_dates=[column, numbers, go, here], na_values=['Unknown'])

您将注意到parse_dates=True已更改为parse_dates=[column, numbers, go, here]。对于行为不正常的列,这可能是到达所需位置的最快速的暴力方法。

相关问题 更多 >

    热门问题