AWS胶水 Spark2.4 Python3 胶水版本2.0
我在多次使用Column方法调用dataframe后发现StackOverflowException
像
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "database_name",
table_name = "table_name", transformation_ctx = "datasource0")
df = datasource0.toDF()
df = df.withColumn('item_name', F.regexp_replace(F.col('item_name'), '^foo$', 'bar'))
df = df.withColumn('item_name', F.regexp_replace(F.col('item_name'), '^foo$', 'bar'))
df = df.withColumn('item_name', F.regexp_replace(F.col('item_name'), '^foo$', 'bar'))
df = df.withColumn('item_name', F.regexp_replace(F.col('item_name'), '^foo$', 'bar'))
... # and call hundreds times
文件说
Blockquote This method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select() with the multiple columns at once.
所以我知道我需要同时对多个列使用select()
但我不知道如何编写代码
根据thisyes withColumn将导致与内存相关的问题,这可以通过使用select来防止,如下所示:
如果要对多个列应用相同的设置,则可以如下所示:
相关问题 更多 >
编程相关推荐