从S3读取大JSON文件（3K+）并从数组中选择特定的键

json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath']) final_data_frame_prep = json_data_frame.withColumn("name", json_data_frame["products"].getItem("name")).withColumn("ndc_product_code", json_data_frame["products"].getItem("ndc_product_code")) final_data_frame = final_data_frame_prep.select("name","ndc_product_code") final_data_frame.show(20,False)

+------------------+----------------------+ |name |ndc_product_code | +------------------+----------------------+ |[Refludan] |[50419-150] | |[Erbitux, Erbitux]|[66733-948, 66733-958]| +------------------+----------------------+

json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath']) final_data_frame_prep = json_data_frame.withColumn("name", explode(json_data_frame["products"].getItem("name"))).withColumn("ndc_product_code", explode(json_data_frame["products"].getItem("ndc_product_code"))).withColumn("dosage_form", explode(json_data_frame["products"].getItem("dosage_form"))).withColumn("strength", explode(json_data_frame["products"].getItem("strength"))) final_data_frame = final_data_frame_prep.select("name","ndc_product_code","dosage_form","strength") final_data_frame.show(20,False)

+------------------+----------------------+ |name |ndc_product_code | +------------------+----------------------+ |[Refludan]|[50419-150]| |[Erbitux]|[66733-948]| |[Erbitux]|[66733-958]| +------------------+----------------------+

1条回答

网友

1楼 · 发布于 2024-05-17 16:16:37

我明白了！你知道吗

+    +        +     -+    -+
|name |ndc_product_code|dosage_form|strength |
+    +        +     -+    -+
|Refludan|50419-150 |Powder |50 mg/1mL|
|Erbitux |66733-948 |Solution |2 mg/1mL |
|Erbitux |66733-958 |Solution |2 mg/1mL |
+    +        +     -+    -+

代码是：

# Read in the json files from s3
json_data_frame = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("s3://" + args['destinationBucketName'] + "/" + args['s3SourcePath'])

final_data_frame_prepprep = json_data_frame.withColumn("products_exp", explode(json_data_frame["products"]))\

final_data_frame_prep = final_data_frame_prepprep.withColumn("name", final_data_frame_prepprep["products_exp"].getItem("name"))\
                                             .withColumn("ndc_product_code", final_data_frame_prepprep["products_exp"].getItem("ndc_product_code"))\
                                             .withColumn("dosage_form", final_data_frame_prepprep["products_exp"].getItem("dosage_form"))\
                                             .withColumn("strength", final_data_frame_prepprep["products_exp"].getItem("strength"))

final_data_frame = final_data_frame_prep.select("name","ndc_product_code","dosage_form","strength")

final_data_frame.show(20,False)

关键是将数据分解为一个整体，然后从数组中获取项目，然后选择要保留的内容。我希望这能帮助别人干杯

相关问题更多 >

编程相关推荐

热门问题

热门文章