“float”与的比较`np.nan公司`在Spark Datafram中

2024-04-23 05:30:28 发布

您现在位置:Python中文网/ 问答频道 /正文

这是预期的行为吗?我本想提出一个关于Spark的问题,但这似乎是一个基本的功能,很难想象这里有一个bug。我错过了什么?在

Python

import numpy as np

>>> np.nan < 0.0
False

>>> np.nan > 0.0
False

PySpark

^{pr2}$

Tags: import功能numpyfalseasnpnanbug
1条回答
网友
1楼 · 发布于 2024-04-23 05:30:28

这既是预期的行为,也是记录在案的行为。引用官方的Spark SQL Guide(重点是我的)的NaN Semantics部分:

There is specially handling for not-a-number (NaN) when dealing with float or double types that does not exactly match standard floating point semantics. Specifically:

  • NaN = NaN returns true.
  • In aggregations, all NaN values are grouped together.
  • NaN is treated as a normal value in join keys.
  • NaN values go last when in ascending order, larger than any other numeric value.

你只看到Python命令行为与AdAs行为的区别。特别是Spark认为NaN是平等的:

spark.sql("""
    WITH table AS (SELECT CAST('NaN' AS float) AS x, cast('NaN' AS float) AS y) 
    SELECT x = y, x != y FROM table
""").show()
^{pr2}$

而普通Python

float("NaN") == float("NaN"), float("NaN") != float("NaN")
(False, True)

和NumPy

np.nan == np.nan, np.nan != np.nan
(False, True)

别这样

您可以查看^{} docstring以获取其他示例。在

所以为了得到期望的结果,你必须显式地检查NaN

from pyspark.sql.functions import col, isnan, when

when(isnan("_1") | isnan("_2"), False).otherwise(col("_1") > col("_2"))

相关问题 更多 >