将RDD转换为Spark DataFram时出现Unicode错误

2024-04-25 12:15:31 发布

您现在位置:Python中文网/ 问答频道 /正文

我在数据帧上运行show方法时遇到以下错误。在

Py4JJavaError: An error occurred while calling o14904.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23450.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23450.0 (TID 120652, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
    process()
  File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<ipython-input-8-b76896bc4e43>", line 320, in <lambda>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-5: ordinal not in range(128)

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:156)

当我只获取12行时,它不会抛出错误。在

^{pr2}$

但当我这么做的时候,我发现了错误。在

我创建了这个数据框架如下:它基本上是一个数据框架,它包含了随机森林模型中的重要值

vocab=np.array(self.cvModel.bestModel.stages[3].vocabulary)
        if est_name=="rf":
            feature_importance=self.cvModel.bestModel.stages[5].featureImportances.toArray()
            argsort_feature_indices=feature_importance.argsort()[::-1]
        elif est_name=="blr":
            feature_importance=self.cvModel.bestModel.stages[5].coefficients.toArray()
            argsort_feature_indices=abs(feature_importance).argsort()[::-1]
        # Sort the features importance array in descending order and get the indices

        feature_names=vocab[argsort_feature_indices]

        self.features_df=sc.parallelize(zip(feature_names,feature_importance[argsort_feature_indices])).\
        map(lambda x: (str(x[0]),float(x[1]))).toDF(["Feature_name","Importance_value"])

Tags: 数据inorgselfapache错误linezip
1条回答
网友
1楼 · 发布于 2024-04-25 12:15:31

我假设您使用的是python2。手头的问题很可能是您的df.map中的str(x[0])部分。似乎x[0]引用了一个unicode字符串,str应该将其转换为bytestring。然而,它通过隐式假设ASCII编码来实现这一点,这只适用于纯英语文本。在

事情不应该这样做。在

简单的回答是:将str(x[0])改为x[0].encode('utf-8')。在

长答案可以找到,例如herehere。在

相关问题 更多 >