PySpark正在读取map complexe类型的Avro文件。获取任务结果时出现异常:java.io.InvalidClassException

2024-03-29 01:42:29 发布

您现在位置:Python中文网/ 问答频道 /正文

当我使用示例avro_inputformat.py中的代码时,我想分享我的问题

schema = open('test_schema_without_map.avsc').read()
conf = {"avro.schema.input.key": reduce(lambda x, y: x + y, schema)}
avro_image_rdd = sc.newAPIHadoopFile(
    input_file,
    "org.apache.avro.mapreduce.AvroKeyInputFormat",
    "org.apache.avro.mapred.AvroKey",
    "org.apache.hadoop.io.NullWritable", 
keyConverter="org.apache.spark.examples.pythonconverters.AvroWrapperToJavaConverter",
    conf=conf
)

output = avro_image_rdd.map(lambda x: x[0]).collect()
for k in output:
    print "Image filename : %s" % k

跑步时呢

^{pr2}$

我得到以下错误

Job aborted due to stage failure: Exception while getting task result: java.io.InvalidClassException: scala.collection.convert.Wrappers$MutableMapWrapper; no valid constructor

读取具有以下架构的avro文件时:

{
"namespace": "test.avro",
"type": "record",
"name": "TestImage",
"fields": [
    {"name": "filename", "type": "string"},
    {"name": "data", "type": "bytes"},
    {"name": "metadata", "type":
        {
            "type": "map", "values": "string"
        }
    }
   ],
}

但是,当架构不包含'map'avro复杂类型时,相同的代码可以正常工作:

{
"namespace": "test.avro",
"type": "record",
"name": "TestImage",
"fields": [
    {"name": "filename", "type": "string"},
    {"name": "data", "type": "bytes"},
   ],
}

如果有人知道问题出在哪里,请分享你的经验。。。在


版本:

  • 火花
  • avro 1.8.0版

avro文件的内容是:

records = [
{
    "filename": "input_filename_1",
    "metadata": {"a": "1", "b": "23"},
    "data": "1,2,3,4,5,6,7,8,9,0"
},
{
    "filename": "input_filename_2",
    "metadata": {"c": "11", "d": "213"},
    "data": "10,11,12,13,14,15"
}
]

Tags: 代码nameorgtestmapinputdatastring