用Sp将列转换为行

3条回答

网友

1楼 · 编辑于 2024-04-18 04:08:06

Spark局部线性代数库目前非常薄弱：而且它们不包括上述基本运算。

对于Spark 2.1，有一个JIRA可以解决这个问题，但这对你今天的工作没有帮助。

需要考虑的是：执行转置很可能需要完全洗牌数据。

现在您需要直接编写RDD代码。我用scala编写了transpose，但不是用python。这是scala版本：

 def transpose(mat: DMatrix) = {
    val nCols = mat(0).length
    val matT = mat
      .flatten
      .zipWithIndex
      .groupBy {
      _._2 % nCols
    }
      .toSeq.sortBy {
      _._1
    }
      .map(_._2)
      .map(_.map(_._1))
      .toArray
    matT
  }

所以你可以把它转换成python供你使用。在这个特殊的时刻，我没有足够的带宽来编写/测试它：如果您无法进行转换，请告诉我。

至少-以下内容很容易转换为python。

zipWithIndex-->；enumerate()（python等价物-记入@zero323）
map-->；[someOperation(x) for x in ..]
groupBy-->；itertools.groupBy()

以下是flatten的实现，它没有等效的python：

  def flatten(L):
        for item in L:
            try:
                for i in flatten(item):
                    yield i
            except TypeError:
                yield item

所以你应该可以把这些放在一起解决问题。

网友

2楼 · 编辑于 2024-04-18 04:08:06

使用基本的Spark SQL函数相对简单。

Python

from pyspark.sql.functions import array, col, explode, struct, lit

df = sc.parallelize([(1, 0.0, 0.6), (1, 0.6, 0.7)]).toDF(["A", "col_1", "col_2"])

def to_long(df, by):

    # Filter dtypes and split into column names and type description
    cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
    # Spark SQL supports only homogeneous columns
    assert len(set(dtypes)) == 1, "All columns have to be of the same type"

    # Create and explode an array of (column_name, column_value) structs
    kvs = explode(array([
      struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
    ])).alias("kvs")

    return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])

to_long(df, ["A"])

斯卡拉：

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions.{array, col, explode, lit, struct}

val df = Seq((1, 0.0, 0.6), (1, 0.6, 0.7)).toDF("A", "col_1", "col_2")

def toLong(df: DataFrame, by: Seq[String]): DataFrame = {
  val (cols, types) = df.dtypes.filter{ case (c, _) => !by.contains(c)}.unzip
  require(types.distinct.size == 1, s"${types.distinct.toString}.length != 1")      

  val kvs = explode(array(
    cols.map(c => struct(lit(c).alias("key"), col(c).alias("val"))): _*
  ))

  val byExprs = by.map(col(_))

  df
    .select(byExprs :+ kvs.alias("_kvs"): _*)
    .select(byExprs ++ Seq($"_kvs.key", $"_kvs.val"): _*)
}

toLong(df, Seq("A"))

网友

3楼 · 编辑于 2024-04-18 04:08:06

使用平面图。下面这样的东西应该有用

from pyspark.sql import Row

def rowExpander(row):
    rowDict = row.asDict()
    valA = rowDict.pop('A')
    for k in rowDict:
        yield Row(**{'A': valA , 'colID': k, 'colValue': row[k]})

newDf = sqlContext.createDataFrame(df.rdd.flatMap(rowExpander))

相关问题更多 >

编程相关推荐

热门问题

热门文章

用Sp将列转换为行

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >