使用矢量汇编程序(Java)在Spark中设置输入和输出时出现问题
我有一个包含5408列的数据集,所有列的打印方式如下: _c0 | U c1 | U c2 | U c3 | U c4 | U c5 | U c6 | U c8 | U c9 |-0.169 |-0.025 |-0.010 |-0.041 |-0.045 |-0.069 1240.038 1240.014;-0.008
我想将所有列放在一起,并将标签分开,但我有一个错误,这意味着不支持我的列: 线程“main”java中的异常。lang.IllegalArgumentException:不支持列_c0的数据类型字符串。 不支持列_c1的数据类型字符串。 不支持列_c2的数据类型字符串。 不支持列_c3的数据类型字符串。 不支持列_c4的数据类型字符串
我想创建一个ML模型,所以我有:
Dataset<Row> csvData = spark.read()
.option("header", false)
.option("inferSchema", true)
.csv("src/main/resou`enter code here`rces/K9.data");
StringIndexer conditionIndexer = new StringIndexer ()
.setInputCol("_c5408")
.setOutputCol("_c5408Index");
csvData = conditionIndexer.fit(csvData).transform(csvData);
//Cleaning data
List<org.apache.spark.sql.Column> list = new ArrayList<org.apache.spark.sql.Column>();
for (String col : csvData.columns()) {
list.add(when(csvData.col(col).equalTo("?"), 0).otherwise(csvData.col(col)).alias(col));
}
csvData = csvData.select(list.toArray(new org.apache.spark.sql.Column[0]));
List<org.apache.spark.sql.Column> list_null = new ArrayList<org.apache.spark.sql.Column>();
for (String col : csvData.columns()) {
list_null.add(when(csvData.col(col).isNull(), 0).otherwise(csvData.col(col)).alias(col));
}
csvData = csvData.select(list_null.toArray(new org.apache.spark.sql.Column[0]));
csvData = csvData.na().fill(0, csvData.columns());
csvData.groupBy(col("_c5408"),col("_c5408Index")).count().show();
ArrayList<String> inputColsList = new ArrayList<>(Arrays.asList(csvData.columns()));
//Make single features column for feature vectors
inputColsList.remove("_c5408Index");
VectorAssembler vectorAssembler = new VectorAssembler()
.setInputCols(inputColsList.parallelStream().toArray(String[]::new))
.setOutputCol("features");
`enter code here` Dataset<Row> modelInputData = vectorAssembler.transform(csvData)
.select("_c5408Index", "features")
.withColumnRenamed("_c5408Index", "label");
错误是当数据集modelInputData=vectorAssembler时。已写入转换(csvData)
模式是
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: string (nullable = true)
|-- _c3: string (nullable = true)
|-- _c4: string (nullable = true)
|-- _c5: string (nullable = true)
|-- _c6: string (nullable = true)
|-- _c7: string (nullable = true)
|-- _c8: string (nullable = true)
|-- _c9: string (nullable = true)
|-- _c10: string (nullable = true)
|-- _c11: string (nullable = true)
|-- _c12: string (nullable = true)
|-- _c13: string (nullable = true)
|-- _c14: string (nullable = true)
|-- _c15: string (nullable = true)
|-- _c16: string (nullable = true)
|-- _c17: string (nullable = true)
|-- _c18: string (nullable = true)
|-- _c19: string (nullable = true)
|-- _c20: string (nullable = true)
c5408Index: double (nullable = false)
数据集看起来像
共 (0) 个答案