PySpark程序出现错误“TypeError: 无效的参数，不是字符串或列”

[('name', 'string'), ('ingredients', 'string'), ('url', 'string'), ('image', 'string'), ('cookTime', 'string'), ('recipeYield', 'string'), ('datePublished', 'strin g'), ('prepTime', 'string'), ('description', 'string')]

def difficulty(cookTime, prepTime): if not cookTime or not prepTime: return "Unkown" total_duration = cookTime + prepTime if total_duration > 3600: return "Hard" elif total_duration > 1800 and total_duration < 3600: return "Medium" elif total_duration < 1800: return "Easy" else: return "Unkown" func_udf = udf(difficulty, IntegerType()) new_emp_final_1 = new_emp_final_1.withColumn("difficulty", func_udf(new_emp_final_1.cookTime, new_emp_final_1.prepTime)) new_emp_final_1.show(20,False)

2条回答

网友

1楼 · 编辑于 2024-05-14 10:44:55

纵观udf（困难），我看到了两件事：

您正在尝试对udf中的两个字符串求和（cookTime和prepTime）
udf应返回StringType（）

这个例子对我很有用：

from pyspark.sql.types import StringType, StructType, StructField, IntegerType
import pandas as pd

schema = StructType([StructField("name", StringType(), True), 
                 StructField('ingredients',StringType(),True), 
                 StructField('url',StringType(),True), 
                 StructField('image',StringType(),True), 
                 StructField('cookTime',StringType(),True), 
                 StructField('recipeYield',StringType(),True), 
                 StructField('datePublished',StringType(),True), 
                 StructField('prepTime',StringType(),True), 
                 StructField('description',StringType(),True)])


data = {
    "name": ['meal1', 'meal2'],
    "ingredients": ['ingredient11, ingredient12','ingredient21, ingredient22'],
    "url": ['URL1', 'URL2'],
    "image": ['Image1', 'Image2'],
    "cookTime": ['60', '3601'],
    "recipeYield": ['recipeYield1', 'recipeYield2'],
    "prepTime": ['0','3000'],
    "description": ['desc1','desc2']
    }

new_emp_final_1_pd = pd.DataFrame(data=data)
new_emp_final_1 = spark.createDataFrame(new_emp_final_1_pd)

def difficulty(cookTime, prepTime):   
    if not cookTime or not prepTime:
        return "Unkown"

    total_duration = int(cookTime) + int(prepTime)
    if total_duration > 3600:
        return "Hard"
    elif total_duration > 1800 and total_duration < 3600:
        return "Medium"
    elif total_duration < 1800:
        return "Easy" 
    else: 
        return "Unkown"

func_udf = udf(difficulty, StringType())
new_emp_final_1 = new_emp_final_1.withColumn("difficulty", 
func_udf(new_emp_final_1.cookTime, new_emp_final_1.prepTime))
new_emp_final_1.show(20,False)

网友

2楼 · 编辑于 2024-05-14 10:44:55

您是否尝试过像这样发送cookTime和prepTime的文字值：

new_emp_final_1 = new_emp_final_1.withColumn("difficulty", func_udf(new_emp_final_1.lit(cookTime), new_emp_final_1.lit(prepTime)))

相关问题更多 >

编程相关推荐

热门问题

热门文章