索引器错误:手动创建spark数据帧时列表索引超出范围?

2024-04-19 07:59:14 发布

您现在位置:Python中文网/ 问答频道 /正文

我试图手动创建一个spark数据帧(一列DT,一行日期为2020-1-1

DT
=======
2020-01-01

但是,它得到了list index out of range的错误

spark = SparkSession.builder\
        .master(f'spark://{IP}:7077')\
        .config('spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version', '2')\
        .appName('g data')\
        .getOrCreate()

spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic')

dates = spark.createDataFrame([(pd.to_datetime('2020-1-1'))], ['DT'])

回溯:

 in brand_tagging_since_until(spark, since, until)
---> 81         dates = spark.createDataFrame([(pd.to_datetime('2020-1-1'))], ['DT'])

/usr/local/bin/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
    746             rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
    747         else:
--> 748             rdd, schema = self._createFromLocal(map(prepare, data), schema)
    749         jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
    750         jdf = self._jsparkSession.applySchemaToPythonRDD(jrdd.rdd(), schema.json())

/usr/local/bin/spark/python/pyspark/sql/session.py in _createFromLocal(self, data, schema)
    419             if isinstance(schema, (list, tuple)):
    420                 for i, name in enumerate(schema):
--> 421                     struct.fields[i].name = name
    422                     struct.names[i] = name
    423             schema = struct

Tags: tonameinselfsqldataschemadt
2条回答

这里有两个问题,尽管在您的示例中没有出现一个。直接的问题是,构造函数希望元组中的值后面有一个,。但是,仅仅简单地添加它就会自动失败,因为构造函数不知道如何处理时间戳对象

from pyspark.sql import SparkSession
import pandas as pd
​
spark = SparkSession.builder.appName("timestamp").getOrCreate()
​
val = pd.to_datetime('2020-1-1')
spark.createDataFrame(
    data=[(val,)],
    schema=["DT"]
).show()
+ -+
| DT|
+ -+
| []|
+ -+

如果您希望像这样使用构造函数,那么您需要事先将其转换为原始Python datetime对象

from pyspark.sql import SparkSession
import pandas as pd
​
spark = SparkSession.builder.appName("timestamp").getOrCreate()

val = pd.to_datetime('2020-1-1')
spark.createDataFrame(
    data=[(val.to_pydatetime(),)],
    schema=["DT"]
).show()
+         -+
|                 DT|
+         -+
|2020-01-01 00:00:00|
+         -+

尽管如此,我还是不清楚在哪里有最清晰的记录。如果您感到好奇,可以在Sparkcodebase或源代码docs中看到此需求

如果您将数据帧传递给构造函数,这将在引擎盖下处理

df = pd.DataFrame({"DT": [val]})
spark.createDataFrame(
    data=df
).show()
+         -+
|                 DT|
+         -+
|2020-01-01 00:00:00|
+         -+

创建数据帧而不依赖熊猫的更直接的方法:

import pyspark.sql.functions as F

dates = spark.createDataFrame([['2020-01-01']], ['DT']) \
             .withColumn('DT', F.col('DT').cast('timestamp'))

dates.show()
+         -+
|                 DT|
+         -+
|2020-01-01 00:00:00|
+         -+

相关问题 更多 >