有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

java pyspark中是否有collect()问题的解决方案?

我想从zip文件中获取文件名,但collect()方法有问题:

sc=SparkContext()
spark = SparkSession(sc)
path= "/Desktop/SparkProject/ok.zip"


import zipfile
import io

def zip_extract(x):
    in_memory_data = io.BytesIO(x[1])
    file_obj = zipfile.ZipFile(in_memory_data, "r")
    files = [i for i in file_obj.namelist()]
    return dict(zip(files, [file_obj.open(file).read() for file in files]))


zips = sc.binaryFiles(path)
files_data = zips.map(zip_extract).collect()

出现以下错误:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (DESKTOP-11E4VBV executor driver): org.apache.spark.SparkException: Python worker failed to connect back.

共 (0) 个答案