Pyspark数据帧：对一列中的唯一值进行计数，与其他列中的值相互独立

+---------+------+------+ |Regulator|Target|Source| +---------+------+------+ | m| A| x| | m| B| x| | m| C| z| | n| A| y| | n| C| x| | n| C| z| +---------+------+------+

+---------+------+------+----------+ |Regulator|Target|Source|No.sources| +---------+------+------+----------+ | m| A| x| 1| | m| B| x| 1| | m| C| z| 2| | n| A| y| 2| | n| C| x| 2| | n| C| z| 2| +---------+------+------+----------+

1条回答

网友

1楼 · 发布于 2024-05-14 14:11:55

这里有一种解决这个问题的方法。为每行创建两个新列：

列'RS'：'Regulator'的源集合
列'TS'：'Target'的源集合

那么你想要的输出就是这些集合的交集的长度。你知道吗

考虑以下示例：

创建数据帧

from pyspark.sql Window
import pyspark.sql.functions as f
cols = ["Regulator", "Target", "Source"]
data = [
    ('m', 'A', 'x'),
    ('m', 'B', 'x'),
    ('m', 'C', 'z'),
    ('n', 'A', 'y'),
    ('n', 'C', 'x'),
    ('n', 'C', 'z')
]

df = sqlCtx.createDataFrame(data, cols)

创建新列

使用^{}和^{}计算'Source'列的不同值：

df = df.withColumn(
    'RS',
    f.collect_set(f.col('Source')).over(Window.partitionBy('Regulator'))
)

df = df.withColumn(
    'TS',
    f.collect_set(f.col('Source')).over(Window.partitionBy('Target'))
)
df.sort('Regulator', 'Target', 'Source').show()
#+    -+   +   +   +    -+
#|Regulator|Target|Source|    TS|       RS|
#+    -+   +   +   +    -+
#|        m|     A|     x|[y, x]|   [z, x]|
#|        m|     B|     x|   [x]|   [z, x]|
#|        m|     C|     z|[z, x]|   [z, x]|
#|        n|     A|     y|[y, x]|[y, z, x]|
#|        n|     C|     x|[z, x]|[y, z, x]|
#|        n|     C|     z|[z, x]|[y, z, x]|
#+    -+   +   +   +    -+

计算交叉口的长度

定义一个udf来返回两个集合交集的长度，并使用它来计算'No_sources'列。（注意，我在列名中使用了_而不是.，因为这样更容易使用select()。）

intersection_length_udf = f.udf(lambda u, v: len(set(u) & set(v)), IntegerType())

df = df.withColumn('No_sources', intersection_length_udf(f.col('TS'), f.col('RS')))

df.select('Regulator', 'Target', 'Source', 'No_sources')\
    .sort('Regulator', 'Target', 'Source')\
    .show()
#+    -+   +   +     +
#|Regulator|Target|Source|No_sources|
#+    -+   +   +     +
#|        m|     A|     x|         1|
#|        m|     B|     x|         1|
#|        m|     C|     z|         2|
#|        n|     A|     y|         2|
#|        n|     C|     x|         2|
#|        n|     C|     z|         2|
#+    -+   +   +     +

相关问题更多 >

编程相关推荐

热门问题

热门文章