我有一个带有顶点和边的图框架,如下所示。我在jupyter笔记本的pyspark上运行这个
vertices = sqlContext.createDataFrame([
("12345", "Alice", "Employee"),
("15789", "Bob", "Employee"),
("13467", "Charlie", "Manager"),
("14890", "David", "Director"),
("17737", "Fanny", "CEO")], ["id", "name", "title"])
edges = sqlContext.createDataFrame([
("12345", "13467", "works"),
("15789", "13467", "works"),
("13467", "14890", "works"),
("14890", "17737", "works"),
], ["src", "dst", "relationship"])
我需要找到每个emp_id到最高级别(在本例中是CEO)的分层路径。我正在尝试bfs方法,到目前为止,我只成功地获得了一个emp_id的路径。 下面是我的代码
g = GraphFrame(vertices,edges)
result = g.bfs(fromExpr = "id == '12345'", toExpr = "title == 'CEO'", edgeFilter = "relationship == 'works'", maxPathLength = 5)
result.show(5,False)
输出:
+----------------------+-------------------+-----------------------+-------------------+----------------------+-------------------+-----------------+
|from |e0 |v1 |e1 |v2 |e2 |to |
+----------------------+-------------------+-----------------------+-------------------+----------------------+-------------------+-----------------+
|[12345,Alice,Employee]|[12345,13467,works]|[13467,Charlie,Manager]|[13467,14890,works]|[14890,David,Director]|[14890,17737,works]|[17737,Fanny,CEO]|
+----------------------+-------------------+-----------------------+-------------------+----------------------+-------------------+-----------------+
我可以将此信息存储在一个变量中,并使用collect()
方法提取。我希望循环遍历顶点的所有id,这些顶点有一个到CEO的路径,并将其写入数据帧。如果有人熟悉画框,你能帮我吗?我曾尝试寻找其他解决方案,但没有一个在我的情况下起作用
预期产出:
+-------+--------------------------+
|user_id|path |
+-------+--------------------------+
|12345 |12345->13467->14890->17737|
|15789 |15789->13467->14890->17737|
|13467 |13467->14890->17737 |
|14890 |14890->17737 |
|17737 |17737 |
+-------+--------------------------+
根据您的问题调整this answer,并整理该答案的结果以获得所需的输出。请注意,您需要在
edges
数据框中交换'src'和'dst'以使该答案起作用,但我认为在修改该答案时,可以以原始形式使用edges
数据框相关问题 更多 >
编程相关推荐