下面是我试图编码的列中所有不同的值。state_msg
是string
。你知道吗
df.groupBy('state_msg').count().show()
+----------+--------+
| state_msg| count|
+----------+--------+
|Redirected| 28|
| Busy| 164790|
| Canceled| 1063663|
| Finished|36100201|
|Terminated| 12982|
| Failed| 941183|
| Timed out| 5726363|
| Error| 1957993|
| Off-line| 186322|
| Not found| 592259|
+----------+--------+
我正在尝试对此列进行热编码:
import pyspark.sql.functions as func
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='state_msg', outputCol='state_msg_index')
indexed_df = indexer.fit(df).transform(df)
但是我收到了这个异常,这毫无意义,因为根据上面groupBy生成的不同值,"1234567890"
不是state_msg
中可能的值。你知道吗
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NumberFormatException: For input string: "1234567890"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:583)
at java.lang.Integer.parseInt(Integer.java:615)
df.groupBy('state_msg').count().show(n=100)
+----------+--------+
| state_msg| count|
+----------+--------+
|Redirected| 28|
| Busy| 165241|
| Canceled| 1067515|
| Finished|36270559|
|Terminated| 12997|
| Failed| 944131|
| Timed out| 5745550|
| Error| 1959041|
| Off-line| 186899|
| Not found| 593823|
+----------+--------+
df.agg(countDistinct('state_msg').alias('count')).show()
+-----+
|count|
+-----+
| 10|
+-----+
目前没有回答
相关问题 更多 >
编程相关推荐