Numpy Select Default条件返回错误的值

2024-04-29 20:16:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我有以下代码:

datetime_const = datetime(2021, 3, 31)
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime1'], format='%Y-%m-%d')
tmp_df1['test_col_1'] = (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12)))
tmp_df1['test_col_2'] = (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
tmp_df1['test_col_3'] = datetime_const + pd.DateOffset(months=12)
tmp_df1['test_col_4'] = datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
tmp_df1['test_col_5'] = tmp_df1['datetime2']
tmp_df1['datetime3'] = np.select(
    [
        (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12))),
        (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
    ],
    [
        datetime_const + pd.DateOffset(months=12),
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],
    default=tmp_df1['datetime2']
)

datetime1是一个对象数据类型,所以我将其转换为datetime64,因为datetime2被指定为

value1是一个浮点数据类型列,包含一组十进制数,但它确实有N

我创建了test_col_1来测试_col_5,以检查我的np.select函数中的各个条件和选项,它们在指定为各个df列时似乎都是正确的

但是,我的datetime3列赋值来自np.select函数,返回一些奇怪的对象数据类型大数字,比如16000000000。我希望它从两个选项中的一个返回datetime64值,或者返回默认的datetime2列值

请参见下面的示例.info和df行:

Data columns (total 8 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   datetime2                   26558 non-null  datetime64[ns]
 1   value1                      25438 non-null  float64       
 2   test_col_1                  26558 non-null  bool          
 3   test_col_2                  26558 non-null  bool          
 4   test_col_3                  26558 non-null  datetime64[ns]
 5   test_col_4                  25438 non-null  datetime64[ns]
 6   test_col_5                  26558 non-null  datetime64[ns]
 7   datetime3                   26558 non-null  object        
dtypes: bool(2), datetime64[ns](4), float64(1), object(1)
memory usage: 1.5+ MB

            datetime2   value1  test_col_1  test_col_2 test_col_3 test_col_4 test_col_5        datetime3
0           2021-06-30 0.00058       False        True 2022-03-31 2021-08-05 2021-06-30        1628121600000000000
1           2022-03-31 0.00044       False       False 2022-03-31 2021-09-13 2022-03-31        1648684800000000000
2           2024-06-07 0.00860       False       False 2022-03-31 2021-04-08 2024-06-07        1717718400000000000
3           2021-09-30 0.00867       False       False 2022-03-31 2021-04-08 2021-09-30        1632960000000000000
4           2021-08-31 0.00144       False       False 2022-03-31 2021-05-21 2021-08-31        1630368000000000000
5           2021-08-31 0.00144       False       False 2022-03-31 2021-05-21 2021-08-31        1630368000000000000
6           2021-04-08 0.00474       False        True 2022-03-31 2021-04-15 2021-04-08        1618444800000000000
7           2023-10-01 0.11506       False       False 2022-03-31 2021-04-01 2023-10-01        1696118400000000000
8           2023-09-29 0.12067       False       False 2022-03-31 2021-04-01 2023-09-29        1695945600000000000
9           2021-05-31 0.02508       False       False 2022-03-31 2021-04-03 2021-05-31        1622419200000000000

我完全被这种行为弄糊涂了,请开导我

提前谢谢大家


Tags: testfalsedatetimecolnulltmppddf1
1条回答
网友
1楼 · 发布于 2024-04-29 20:16:31

在使用np.select时,日期似乎从历元时间转换为int64中的表示形式。一个简单的修复方法是使用astype进行转换

# dummy
tmp_df1 = pd.DataFrame([['2021-06-30', 0.00058],['2023-10-01', 0.11506 ]],
                       columns= ['datetime2','value1'])
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime2'], format='%Y-%m-%d')


tmp_df1['datetime3'] = np.select(
    [
        (tmp_df1['value1'] < 0.0002) & (tmp_df1['datetime2'] < (datetime_const + pd.DateOffset(months=12))),
        (tmp_df1['value1'] >= 0.0002) & ((((tmp_df1['datetime2'] - datetime_const ).dt.days/365)*tmp_df1['value1']) < 0.0002)
    ],
    [
        datetime_const + pd.DateOffset(months=12),
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],
    default=tmp_df1['datetime2']
).astype('datetime64[ns]') ### < - add this

print(tmp_df1)
   datetime2   value1  datetime3
0 2021-06-30  0.00058 2021-08-04
1 2023-10-01  0.11506 2023-10-01

更详细的解释

我认为问题在于你的两个选择,因为其中一个是单个值(第一个),第二个是一个系列。您可以看到,当第二个选择也是一个系列(使用datetime数据类型)时,它也可以工作

# dummy
tmp_df1 = pd.DataFrame([['2021-06-30', 0.00058],['2023-10-01', 0.11506 ]],
                       columns= ['datetime2','value1'])
tmp_df1['datetime2'] = pd.to_datetime(tmp_df1['datetime2'], format='%Y-%m-%d')

如果我使用你的方法,我会得到长整数表示(和你一样)

np.select(
    ...
    [
        datetime_const + pd.DateOffset(months=12),
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],...
)
# gives
array([1628035200000000000, 1696118400000000000], dtype=object)

但是通过创建一个系列(与您的用例无关)来替换第一选择中的datetime_常量

np.select(
    ...
    [
        tmp_df1['datetime2'] + pd.DateOffset(months=12), # here replace the constant by the column datetime2 for example
        datetime_const + pd.to_timedelta(((0.0002/tmp_df1['value1'])*365).round(), unit='D')
    ],
    ...
)
# get the good date format (wrong value of course)
array(['2021-08-04T00:00:00.000000000', '2023-10-01T00:00:00.000000000'],
      dtype='datetime64[ns]')

相关问题 更多 >