列中不可见值的StringIndexer NumberFormatException

2021-09-27 06:42:49 发布

您现在位置:Python中文网/ 问答频道 /正文

下面是我试图编码的列中所有不同的值。state_msgstring。你知道吗

df.groupBy('state_msg').count().show()
+----------+--------+                                                           
| state_msg|   count|
+----------+--------+
|Redirected|      28|
|      Busy|  164790|
|  Canceled| 1063663|
|  Finished|36100201|
|Terminated|   12982|
|    Failed|  941183|
| Timed out| 5726363|
|     Error| 1957993|
|  Off-line|  186322|
| Not found|  592259|
+----------+--------+

我正在尝试对此列进行热编码:

import pyspark.sql.functions as func

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol='state_msg', outputCol='state_msg_index')
indexed_df = indexer.fit(df).transform(df)

但是我收到了这个异常,这毫无意义,因为根据上面groupBy生成的不同值,"1234567890"不是state_msg中可能的值。你知道吗

    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NumberFormatException: For input string: "1234567890"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:583)
    at java.lang.Integer.parseInt(Integer.java:615)

df.groupBy('state_msg').count().show(n=100)
+----------+--------+
| state_msg|   count|
+----------+--------+
|Redirected|      28|
|      Busy|  165241|
|  Canceled| 1067515|
|  Finished|36270559|
|Terminated|   12997|
|    Failed|  944131|
| Timed out| 5745550|
|     Error| 1959041|
|  Off-line|  186899|
| Not found|  593823|
+----------+--------+

df.agg(countDistinct('state_msg').alias('count')).show()

+-----+
|count|
+-----+
|   10|
+-----+