Spark：如何将元组转换为数据帧

File "/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000017/pyspark.zip/pyspark/worker.py", line 253, in main process() File "/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000017/pyspark.zip/pyspark/worker.py", line 248, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000002/pyspark.zip/pyspark/rdd.py", line 2440, in pipeline_func File "/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000002/pyspark.zip/pyspark/rdd.py", line 2440, in pipeline_func File "/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000002/pyspark.zip/pyspark/rdd.py", line 350, in func File "/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000002/pyspark.zip/pyspark/rdd.py", line 1859, in combineLocally File "/mnt/hadoop/yarn/local/usercache/hdfs/appcache/application_/container_05_000017/pyspark.zip/pyspark/shuffle.py", line 237, in mergeValues for k, v in iterator: TypeError: cannot unpack non - iterable NoneType object

rdd = sc.parallelize([('a',1), (('a',1), ('b',2)), (('a',1), ('b',2), ('c',3) ) ]) schema = StructType([ StructField("a", StringType(), True), StructField("b", StringType(), True), StructField("c", StringType(), True), ]) train_label_df = sqlContext.createDataFrame(rdd, schema) train_label_df.show()

File "/home/spark/python/pyspark/sql/types.py", line 1400, in verify_struct "length of fields (%d)" % (len(obj), len(verifiers)))) ValueError: Length of object (2) does not match with length of fields (3)

1条回答

网友

1楼 · 发布于 2024-05-14 15:26:52

您可以将元组映射到dict：

rdd1 = rdd.map(lambda x: dict(x if isinstance(x[0],tuple) else [x]))

然后执行以下操作之一：

from pyspark.sql import Row 

cols = ["a", "b", "c"]

rdd1.map(lambda x: Row(**{c:x.get(c) for c in cols})).toDF().show()
+ -+  +  +
|  a|   b|   c|
+ -+  +  +
|  1|null|null|
|  1|   2|null|
|  1|   2|   3|
+ -+  +  +

或

rdd1.map(lambda x: tuple(x.get(c) for c in cols)).toDF(cols).show()

相关问题更多 >

编程相关推荐

热门问题

热门文章