在Pyspark中是否可以对DataFrame进行子类化？

1条回答

网友

1楼 · 发布于 2024-04-26 12:10:00

这取决于你的目标。在

从技术上讲这是可能的。pyspark.sql.DataFrame只是一个普通的Python类。如果你需要的话，你可以扩展它或者猴子补丁。在

from pyspark.sql import DataFrame

class DataFrameWithZipWithIndex(DataFrame):
     def __init__(self, df):
         super(self.__class__, self).__init__(df._jdf, df.sql_ctx)

     def zipWithIndex(self):
         return (self.rdd
             .zipWithIndex()
             .map(lambda row: (row[1], ) + row[0])
             .toDF(["_idx"] + self.columns))

用法示例：

^{pr2}$

True

with_zipwithindex.zipWithIndex().show()

+  + -+ -+
|_idx|foo|bar|
+  + -+ -+
|   0|  a|  1|
+  + -+ -+

实际上，你在这里做不了什么。DataFrame是一个围绕JVM对象的瘦包装器，除了提供docstring、将参数转换为本机所需的形式、调用JVM方法以及在必要时使用Python适配器包装结果之外，没有太大作用。在
使用纯Python代码，您甚至无法接近DataFrame/Dataset内部或修改其核心行为。如果您正在寻找独立的，Python-only-SparkDataFrame实现，这是不可能的。

相关问题更多 >

编程相关推荐

热门问题

热门文章

在Pyspark中是否可以对DataFrame进行子类化？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >