对PySpark数据帧中列的所有值进行切片

2024-04-26 07:27:28 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个数据帧,我想切片该列的所有值,但我不知道如何做

我的数据帧

+-------------+------+
|    studentID|gender|
+-------------+------+
|1901000200   |     M|
|1901000500   |     M|
|1901000500   |     M|
|1901000500   |     M|
|1901000500   |     M|
+-------------+------+

我已将studentID转换为字符串,但无法从中删除前190个。我想要低于输出

+-------------+------+
|    studentID|gender|
+-------------+------+
|   1000200   |     M|
|   1000500   |     M|
|   1000500   |     M|
|   1000500   |     M|
|   1000500   |     M|
+-------------+------+

我尝试了下面的方法,但它给了我错误

students_data = students_data.withColumn('studentID',F.lit(students_data["studentID"][2:]))

TypeError: startPos and length must be the same type. Got <class 'int'> and <class 'NoneType'>, respectively.

Tags: and数据方法字符串data错误切片gender
1条回答
网友
1楼 · 发布于 2024-04-26 07:27:28
from pyspark.sql import functions as F

# replicating the sample data from the OP.
students_data = sqlContext.createDataFrame(
[[1901000200,'M'],
[1901000500,'M'],
[1901000500,'M'],
[1901000500,'M'],
[1901000500,'M']],
["studentid", "gender"])

# unlike a simple python list transformation - we need to define the last position in the transform
# in case you aren't sure about the length one can define a random large number say 10k.
students_data = students_data.withColumn(
  'studentID',
  F.lit(students_data["studentID"][4:10000]).cast("string"))

students_data.show()

输出:

+    -+   +
|studentID|gender|
+    -+   +
|  1000200|     M|
|  1000500|     M|
|  1000500|     M|
|  1000500|     M|
|  1000500|     M|
+    -+   +

相关问题 更多 >