<p>不能向Spark中的<code>DataFrame</code>添加任意列。只有使用文本才能创建新列(其他文本类型在<a href="https://stackoverflow.com/q/32788322">How to add a constant column in a Spark DataFrame?</a>中描述)</p>
<pre><code>from pyspark.sql.functions import lit
df = sqlContext.createDataFrame(
[(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
df_with_x4 = df.withColumn("x4", lit(0))
df_with_x4.show()
## +---+---+-----+---+
## | x1| x2| x3| x4|
## +---+---+-----+---+
## | 1| a| 23.0| 0|
## | 3| B|-23.0| 0|
## +---+---+-----+---+
</code></pre>
<p>转换现有列:</p>
<pre><code>from pyspark.sql.functions import exp
df_with_x5 = df_with_x4.withColumn("x5", exp("x3"))
df_with_x5.show()
## +---+---+-----+---+--------------------+
## | x1| x2| x3| x4| x5|
## +---+---+-----+---+--------------------+
## | 1| a| 23.0| 0| 9.744803446248903E9|
## | 3| B|-23.0| 0|1.026187963170189...|
## +---+---+-----+---+--------------------+
</code></pre>
<p>包括使用<code>join</code>:</p>
<pre><code>from pyspark.sql.functions import exp
lookup = sqlContext.createDataFrame([(1, "foo"), (2, "bar")], ("k", "v"))
df_with_x6 = (df_with_x5
.join(lookup, col("x1") == col("k"), "leftouter")
.drop("k")
.withColumnRenamed("v", "x6"))
## +---+---+-----+---+--------------------+----+
## | x1| x2| x3| x4| x5| x6|
## +---+---+-----+---+--------------------+----+
## | 1| a| 23.0| 0| 9.744803446248903E9| foo|
## | 3| B|-23.0| 0|1.026187963170189...|null|
## +---+---+-----+---+--------------------+----+
</code></pre>
<p>或使用函数/udf生成:</p>
<pre><code>from pyspark.sql.functions import rand
df_with_x7 = df_with_x6.withColumn("x7", rand())
df_with_x7.show()
## +---+---+-----+---+--------------------+----+-------------------+
## | x1| x2| x3| x4| x5| x6| x7|
## +---+---+-----+---+--------------------+----+-------------------+
## | 1| a| 23.0| 0| 9.744803446248903E9| foo|0.41930610446846617|
## | 3| B|-23.0| 0|1.026187963170189...|null|0.37801881545497873|
## +---+---+-----+---+--------------------+----+-------------------+
</code></pre>
<p>性能方面的内置函数(<code>pyspark.sql.functions</code>)映射到Catalyst表达式,通常比Python用户定义的函数更受欢迎。</p>
<p>如果要将任意RDD的内容添加为列,可以</p>
<ul>
<li>添加<a href="https://stackoverflow.com/a/32761138/1560062">row numbers to existing data frame</a></li>
<li>调用RDD上的<code>zipWithIndex</code>,并将其转换为数据帧</li>
<li>使用索引作为联接键联接两者</li>
</ul>