Pyspark数据帧转义&amp；

rdd = sc.textFile('/mnt/input/AMP test.csv') rdd = rdd.map(lambda x: x.replace('&', '&')) rdd.coalesce(1).saveAsTextFile("/mnt/input/AMP test escaped.csv") df = spark.read.option("delimiter", ";").option("header","true").csv('/mnt/input/AMP test escaped.csv') df.show()

2条回答

网友

1楼 · 编辑于 2024-05-15 00:31:19

我认为没有办法仅使用spark.read.csv来逃避这个复杂的字符&，解决方案就像你做了“变通”一样：

rdd.map：此函数已将所有列中的值&替换为&
无需将rdd保存在临时路径中，只需将其作为csv参数传递：

rdd = sc.textFile("your_path").map(lambda x: x.replace("&amp;", "&"))

df = spark.read.csv(rdd, header=True, sep=";")
df.show()

+ -+      -+    +
| ID|    FirstName|LastName|
+ -+      -+    +
|  1|     Chandler|    Bing|
|  2|Ross & Monica|  Geller|
+ -+      -+    +

网友

2楼 · 编辑于 2024-05-15 00:31:19

您可以直接使用数据帧来实现这一点。如果您知道至少有一个文件不包含任何&来检索架构，那么它会有所帮助

假设存在这样一个文件，并且其路径为“valid.csv”

from pyspark.sql import functions as F

# I acquire a valid file without the &amp; wrong data to get a nice schema
schm = spark.read.csv("valid.csv", header=True, inferSchema=True, sep=";").schema


df = spark.read.text("/mnt/input/AMP test.csv")

# I assume you have several files, so I remove all the headers.
# I do not need them as I already have my schema in schm.
header = df.first().value
df = df.where(F.col("value") != header)


# I replace "&amp;" with "&", and split the column
df = df.withColumn(
    "value", F.regexp_replace(F.col("value"), "&amp;", "&")
).withColumn(
    "value", F.split("value", ";")
)

# I explode the array in several columns and add types based on schm defined previously
df = df.select(
    *(
        F.col("value").getItem(i).cast(col.dataType).alias(col.name)
        for i, col in enumerate(schm)
    )
)

结果如下：

df.show()
+ -+      -+    +
| ID|    FirstName|LastName|
+ -+      -+    +
|  1|     Chandler|    Bing|
|  2|Ross & Monica|  Geller|
+ -+      -+    +

df.printSchema()
root
 |  ID: integer (nullable = true)
 |  FirstName: string (nullable = true)
 |  LastName: string (nullable = true)

相关问题更多 >

编程相关推荐

热门问题

热门文章

Pyspark数据帧转义&amp；

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >