在pyspark数据帧中执行动态条件

2024-04-26 14:47:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧,target_df有20列。在这20列中,有4列是必需的,也就是说,这4列应该有值,并且不应该为null,这些列中没有空格。 我想为这4列筛选包含null和空格的行

This can be done with below condition.
filtered_df=target_df.filter((trim(target_df['Col 1'])==' ') | (target_df['Col 1'].isNull()) |
(trim(target_df['Col 2'])==' ') | (target_df['Col 2'].isNull()) |
(trim(target_df['Col 3'])==' ') | (target_df['Col 3'].isNull()) |
(trim(target_df['Col 4'])==' ') | (target_df['Col 4'].isNull()))

I want to make this dynamic and based on list of columns, i want to generate the condition.

mandatory_col=['col 1', 'col 2', 'col 3', 'col 4']
ln=[]
for ele in mandatory_col:
    str1="(trim(target_df['{}'])==' ') | (target_df['{}'].isNull())".format(ele, ele)
    ln.append(str1)
  
condition=' | '.join(ln)

print(condition):

(trim(target_df['Col 1'])==' ') | (target_df['Col 1'].isNull()) |
(trim(target_df['Col 2'])==' ') | (target_df['Col 2'].isNull()) |
(trim(target_df['Col 3'])==' ') | (target_df['Col 3'].isNull()) |
(trim(target_df['Col 4'])==' ') | (target_df['Col 4'].isNull())

filtered_df=target_df.filter(condition)  
when I try to execute above condition,  it throws error

ParseException: 
mismatched input ')' expecting {'COLLECT', 'CONVERT', 'DELTA', 'HISTORY', 'MATCHED', 'MERGE', 'OPTIMIZE', 'SAMPLE', 'TIMESTAMP', 'UPDATE', 'VERSION',....., IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 77)

The reason being condition is string and df filter takes pyspark.sql.column.Column.

please suggest how can I achieve executing string expression.


Tags: totargetdfcolfilterconditionnullcan
1条回答
网友
1楼 · 发布于 2024-04-26 14:47:08

请使用expr使用此字符串操作。 我希望您的列名中没有空格。 请注意,您提出的问题是基于用例的。 阅读关于stackoverflow的提问指南。

from pyspark.sql.functions import expr

mandatory_col = ['col1', 'col2', 'col3', 'col4']
str1 = ''
for ele in mandatory_col:
    str1 = str1 + '''trim('{}')'''.format(ele) + ''' is not null or ''' + '''trim('{}')'''.format(ele) + ''' = ' '  or '''

print(str1[:-5])

trim('col1') is not null or trim('col1') = ' '  or trim('col2') is not null or trim('col2') = ' '  or trim('col3') is not null or trim('col3') = ' '  or trim('col4') is not null or trim('col4') = ' '

filtered_df = target_df.filter(expr(str1[:-5]))

相关问题 更多 >