UDF在PySp中运行两次问题的回答

UDF在PySp中运行两次

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

我有一个简单的spark数据帧，它有两列，都是字符串；一个叫做<code>id</code>，另一个叫做<code>name</code>。我还有一个名为<code>string_replacement</code>的Python函数，它执行一些字符串操作。我已经定义了一个包含<code>string_replacement</code>并应用于数据帧的每一行的包装器UDF。只有<code>name</code>列被传递给字符串操作函数。这是密码 <pre><code># Import libraries from pyspark.sql import * import pyspark.sql.functions as f from pyspark.sql.types import * # Create Example Dataframe row1 = Row(id='123456', name='Computer Science') df = spark.createDataFrame([row1]) # Print the dataframe df.show() # Define function that does some string operations def string_replacement(input_string): string=input_string string=string.replace('Computer', 'Computer x') string=string.replace('Science', 'Science x') return string # Define wrapper function to turn into UFD def wrapper_func(row): temp=row[1] temp=string_replacement(temp) row[1]=temp return row # Create the schema for the resulting data frame output_schema = StructType([StructField('id', StringType(), True), StructField('name', StringType(), True)]) # UDF to apply the wrapper function to the dataframe new_udf=f.udf(lambda z: wrapper_func(z), output_schema) cols=df.columns new_df=df.select(new_udf(f.array(cols)).alias('results')).select(f.col('results.*')) new_df.show(truncate = False) </code></pre> 函数将单词<code>Computer</code>转换为<code>Computer x</code>。对单词<code>Science</code>也是这样。你知道吗 原始数据帧如下所示 <pre><code>+------+----------------+ | id| name| +------+----------------+ |123456|Computer Science| +------+----------------+ </code></pre> 应用函数后，看起来是这样的 <pre><code>+------+------------------------+ |id |name | +------+------------------------+ |123456|Computer x x Science x x| +------+------------------------+ </code></pre> 从<code>x x</code>可以看出，它已经运行了两次函数。在第一次运行的输出上的第二次。如何避免这种行为？ 有趣的是，如果我不分解生成的数据帧，它看起来很好： <pre><code>new_df=df.select(new_udf(f.array(cols)).alias('results')) </code></pre> 给你 <pre><code>+-----------------------------+ |results | +-----------------------------+ |[123456,Computer x Science x]| +-----------------------------+ </code></pre>

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

UDF在PySp中运行两次

1 个回答

相关Python问题