如果列表中的值位于另一列中,则Pyspark会更改列值

2024-05-16 14:25:27 发布

您现在位置:Python中文网/ 问答频道 /正文

我有这样一个数据帧:

+-------+----------------+
|Name   |Source          |
+-------+----------------+
|Tom    |clientA-incoming|
|Dick   |clientB-incoming|
|Harry  |c-abc-incoming  |

我想添加一列slug以结束此数据帧:

+-------+----------------+--------+
|Name   |Source          |slug    |
+-------+----------------+--------+
|Tom    |clientA-incoming|clientA |
|Dick   |clientB-incoming|clientB |
|Harry  |c-abc-incoming  |c-abc   |

我有一个值列表,其中包含slug:

slugs = ['clientA', 'clientB', 'c-abc']

基本上,我在想这个伪代码:

for i in slugs:
    if i in df['Source']:
        df['Slug'] = i

有人能帮我越过终点线吗

编辑:

我想用slugs列表中的值更新slug列。进入slug列的特定值根据Source列确定

例如,由于slugs[0] = 'clientA'和clientA是clientA-incoming的子字符串,我想将slug列中该行的值更新为clientA


Tags: 数据nameinsourcedf列表abcslug
1条回答
网友
1楼 · 发布于 2024-05-16 14:25:27

根据您的要求,可以使用左连接或内连接解决此问题:

from pyspark.sql.functions import broadcast

slugs = ['clientA', 'clientB', 'c-abc', 'f-gd']
sdf = spark.createDataFrame(slugs, "string").withColumnRenamed("value", "slug")

df = spark.createDataFrame([
  ["Tom", "clientA-incoming"],
  ["Dick", "clientB-incoming"],
  ["Harry", "c-abc-incoming"],
  ["Harry", "c-dgl-incoming"]
], ["Name", "Source"])

df.join(broadcast(sdf), df["Source"].contains(sdf["slug"]), "left").show()

# +  -+        +   -+
# | Name|          Source|   slug|
# +  -+        +   -+
# |  Tom|clientA-incoming|clientA|
# | Dick|clientB-incoming|clientB|
# |Harry|  c-abc-incoming|  c-abc|
# |Harry|  c-dgl-incoming|   null|
# +  -+        +   -+

注意,我们广播较小的df以防止混洗

相关问题 更多 >