Apache Spark.read未按预期工作

2024-05-14 17:07:35 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在学习IBM Apache Spark。我正在使用HMP数据集。我遵循了教程中的说明,但代码没有按预期工作。这是我的密码:

!git clone https://github.com/wchill/HMP_Dataset

from pyspark.sql.types import StructType, StructField, IntegerType

schema = StructType([
    StructField("x",IntegerType(), True),
    StructField("y",IntegerType(), True),
    StructField("z",IntegerType(), True)
])

import os
file_list = os.listdir("HMP_Dataset")
file_list_filtered = [file for file in file_list if "_" in file]
from pyspark.sql.functions import lit
for cat in file_list_filtered:
    data_files = os.listdir("HMP_Dataset/" + cat)

    for data_file in data_files:
        print(data_file)

        temp_df = spark.read.option("header","false").option( "delimeter" , " ").csv("HMP_Dataset/" + cat + "/" + data_file, schema=schema)

        temp_df = temp_df.withColumn("class",lit(cat))
        temp_df = temp_df.withColumn("source",lit(data_file))

        if df is None:
            df = temp_df
        else:
            df = df.union(temp_df)

使用df.show()方法时,x、y、z的模式保持为空。以下是输出:

+----+----+----+-----------+--------------------+
|   x|   y|   z|      class|              source|
+----+----+----+-----------+--------------------+
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
|null|null|null|Brush_teeth|Accelerometer-201...|
+----+----+----+-----------+--------------------+
only showing top 20 rows

x、y、z列必须有数字。我做错了什么?我使用了教程视频中显示的确切代码。我正在使用IBM Watson Studio运行该程序。链接到教程https://www.coursera.org/learn/advanced-machine-learning-signal-processing/lecture/8cfiW/introduction-to-sparkml


Tags: indfdatanulltempdatasetlistcat
1条回答
网友
1楼 · 发布于 2024-05-14 17:07:35

似乎您在指定“delimeter”的选项中有输入错误,而要传递的正确选项是“delimiter”

temp_df = spark.read.option("header","false").option( "delimeter" , " ").csv("HMP_Dataset/" + cat + "/" + data_file, schema=schema)

正确:-

temp_df = spark.read.option("header","false").option( "delimiter" , " ").csv("HMP_Dataset/" + cat + "/" + data_file, schema=schema)

您也可以选择使用“sep”作为分隔符。 有关更多参考,请参阅此处或spark文档中的spark csv:- https://github.com/databricks/spark-csv

相关问题 更多 >

    热门问题