PySpark AnalysisException:无法解析列名

tempList = [] for col in Df.columns: new_name = col.strip() new_name = "".join(new_name.split()) new_name = new_name.replace('.','') tempList.append(new_name) Df = Df.toDF(*tempList)

StructType(List(StructField(A,ShortType,true),StructField(B,ShortType,true),StructField(C,IntegerType,true),StructField(D,IntegerType,true),StructField(E,StringType,true),StructField(F,DoubleType,true),StructField(G,IntegerType,true)))

df = sc.parallelize([[1,2,3], [2,3,4]]).toDF(("a_1", "b", "c")) def estimateCovariance(df): m = df.select(df['features']).map(lambda x: x[0]).mean() dfZeroMean = df.select(df['features']).map(lambda x: x[0]).map(lambda x: x-m) # subtract the mean return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count() def pca(df, k=2): cov = estimateCovariance(df) col = cov.shape[1] eigVals, eigVecs = eigh(cov) inds = np.argsort(eigVals) eigVecs = eigVecs.T[inds[-1:-(col+1):-1]] components = eigVecs[0:k] eigVals = eigVals[inds[-1:-(col+1):-1]] # sort eigenvalues score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) ) scoreDF = sqlContext.createDataFrame(score.map(lambda x: (DenseVector(x),)), ['pca_features']) # Return the `k` principal components, `k` scores, and all eigenvalues return components.T, scoreDF, eigVals comp, score, eigVals = pca(df) score.collect()

2条回答

网友
1楼 · 编辑于 2024-04-20 08:13:16

从您链接到的文章中：
The input to our pca procedure consists of a Spark dataframe, which includes a column named features containing the features as DenseVectors.
再进一步，我们将为您提供一个如何构建示例数据集的示例：
>>> data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), ... (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), ... (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)] >>> df = sqlContext.createDataFrame(data,["features"])
数据集包含许多不同列中的数据。你需要把它转换成一列向量。Spark ML有一个用于此的工具，即pyspark.ml.feature.^{}
在您的情况下，您需要以下内容：
from pyspark.ml.feature import VectorAssembler vectorAssembler = VectorAssembler(inputCols=["a_1", "b", "c"], outputCol="features") comp, score, eigVals = pca(vectorAssembler.transform(df))

网友
2楼 · 编辑于 2024-04-20 08:13:16

看起来您没有列features-如果我正确理解了这个问题，那么本例中的所有列都是特性，因此您可能希望选择所有列

相关问题更多 >

编程相关推荐

热门问题

热门文章