如何选择最后一行，以及如何按索引访问PySpark数据帧？

网友

1楼 · 编辑于 2024-04-20 00:06:09

from pyspark.sql import functions as F

expr = [F.last(col).alias(col) for col in df.columns]

df.agg(*expr)

只是一个提示：看起来您仍然拥有与熊猫或R一起工作的人的心态。Spark是我们处理数据的方式的不同范例。你不再访问单个单元格中的数据了，现在你要处理整个单元格块。如果你像刚才那样不断地收集资料和执行操作，你就会失去spark提供的并行性的全部概念。看看Spark中转换与动作的概念。

网友

2楼 · 编辑于 2024-04-20 00:06:09

How to get the last row.

如果有一列可用于对数据帧进行排序，例如“索引”，则获取最后一条记录的一种简单方法是使用SQL： 1）按降序排列您的桌子 2）从这个订单中取第一个值

df.createOrReplaceTempView("table_df")
query_latest_rec = """SELECT * FROM table_df ORDER BY index DESC limit 1"""
latest_rec = self.sqlContext.sql(query_latest_rec)
latest_rec.show()

And how can I access the dataframe rows by index.like row no. 12 or 200 .

类似的方式，你可以在任何一行记录

row_number = 12
df.createOrReplaceTempView("table_df")
query_latest_rec = """SELECT * FROM (select * from table_df ORDER BY index ASC limit {0}) ord_lim ORDER BY index DESC limit 1"""
latest_rec = self.sqlContext.sql(query_latest_rec.format(row_number))
latest_rec.show()

如果没有“index”列，可以使用

from pyspark.sql.functions import monotonically_increasing_id

df = df.withColumn("index", monotonically_increasing_id())

网友

3楼 · 编辑于 2024-04-20 00:06:09

How to get the last row.

假设所有列都是可编码的长而丑陋的方式：

from pyspark.sql.functions import (
    col, max as max_, struct, monotonically_increasing_id
)

last_row = (df
    .withColumn("_id", monotonically_increasing_id())
    .select(max(struct("_id", *df.columns))
    .alias("tmp")).select(col("tmp.*"))
    .drop("_id"))

如果不是所有列都可以排序，则可以尝试：

with_id = df.withColumn("_id", monotonically_increasing_id())
i = with_id.select(max_("_id")).first()[0]

with_id.where(col("_id") == i).drop("_id")

注意。在pyspark.sql.functions/`o.a.s.sql.functions中有last函数，但是考虑到description of the corresponding expressions，这里不是一个好的选择。

how can I access the dataframe rows by index.like

你不能。火花DataFrame，可通过索引访问。You can add indices using ^{}然后过滤。记住这个手术。

相关问题更多 >

编程相关推荐

热门问题

热门文章