使用toPandas时强制将null一致转换为nan

sparkTest = sc.createDataFrame( [ (1, 1 ), (2, None), (None, None), ], ['a', 'b'] ) sparkTest.show() # all None values are neatly converted to null pdTest1 = sparkTest.toPandas() pdTest1 # all None values are NaN np.isnan(pdTest1['b']) # this a series of dtype bool pdTest2 = sparkTest.filter(col('b').isNull()).toPandas() pdTest2 # the null value in column a is still NaN, but the two null in column b are now None np.isnan(pdTest2['b']) # this throws an error

1条回答

网友

1楼 · 发布于 2024-05-14 20:37:51

np.isnan可以应用于本机数据类型的NumPy数组（如np.float64），但应用于对象数组时会引发TypeError：

pdTest1['b']
0    1.0
1    NaN
2    NaN
Name: b, dtype: float64

pdTest2['b']
0    None
1    None
Name: b, dtype: object

如果您有熊猫，您可以使用pandas.isnull：

import pandas as pd


pd.isnull(pdTest1['b'])
0    False
1     True
2     True
Name: b, dtype: bool


pd.isnull(pdTest2['b'])
0    True
1    True
Name: b, dtype: bool

这对于np.nan和None都是一致的

或者，您可以（如果可能的话）将pdTest2['b']数组强制转换为本机numpy类型之一（例如np.float64），以确保np.isnan正常工作，例如：

pdTest2 = sparkTest.filter(f.col('b').isNull()).toPandas()
np.isnan(pdTest2['b'].astype(np.float64)) 
0    True
1    True
Name: b, dtype: bool

相关问题更多 >

编程相关推荐

热门问题

热门文章