在pyspark中的数据帧中构造嵌套json

2024-04-28 07:42:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我在构建以下数据时遇到了一些困难,我希望在这个主题上得到专家的帮助

我需要在pyspark的数据帧中构造一个json。我没有完整的模式,但下面的嵌套结构不变:

import http.client conn = http.client.HTTPSConnection("xxx")

payload = ""

conn.request("GET", "xxx", payload)

res = conn.getresponse() data = res.read().decode("utf-8")

json_obj = json.loads(data)

df = json.dumps(json_obj, indent=2)

这是Json:

 {   "car": {
    "top1": {
      "cl": [
        {
          "nm": "Setor A",
          "prc": "40,00 %",
          "tv": [
            {
              "logo": "https://www.test.com/ddd.jpg",
              "nm": "BDFG",
              "lk1": "https://www.test.com/ddd/BDFG/",
              "lk2": "https://www.test-ddd.com",
              "dta": [
                {
                  "nm": "PA",
                  "cp": "nl",
                  "vl": "$ 2,50"
                },
                {
                  "nm": "FVP",
                  "cp": "UV",
                  "vl": "No"
                }
              ],
              "prc": "30,00 %"
            },
            {
              "logo": "https://www.test.com/ccc.jpg",
              "nome": "BDFH",
              "lk1": "https://www.test.com/ddd/BDFH/",
              "lk2": "https://www.test-ddd.com",
              "dta": [
                {
                  "nm": "PA",
                  "cp": "nl",
                  "vl": "$ 2,50"
                },
                {
                  "nm": "FVP",
                  "cp": "UV",
                  "vl": "No"
                }
              ],
              "prc": "70,00 %"
            }
          ]
        },
        {
          "nm": "B",
          "prc": "60,00 %",
          "tv": [
            {
              "logo": "https://www.test.com/bomm.jpg",
              "nm": "BOOM",
              "lk1": "https://www.test.com/ddd/BDFH/",
              "lk2": "https://www.test-ddd.com",
              "dta": [
                {
                  "nm": "PA",
                  "cp": "nl",
                  "vl": "$ 2,50"
                },
                {
                  "nm": "FVP",
                  "cp": "UV",
                  "vl": "No"
                }
              ],
              "prc": "100,00 %"
            }
          ]
        }
      ]
    },
    "top2": {
      "cl": [{}]
    "top3": {
      "cl": [{}]
     }

Example of a json file

我试图以某种方式组织我的数据,但没有成功:

schema = StructType(
    [
      StructField("car", ArrayType(StructType([
        StructField("top1", ArrayType(StructType([
          StructField("cl", ArrayType(StructType([
            StructField("nm", StringType(),True),
            StructField("prc", StringType(),True),
            StructField("tv", ArrayType(StructType([
              StructField("logo", StringType(),True),
              StructField("nm", StringType(),True),
              StructField("lk1", StringType(),True),
              StructField("lk2", StringType(),True),
              StructField("dta", ArrayType(StructType([
                StructField("nm", StringType(),True),
                StructField("cp", StringType(),True),
                StructField("vl", StringType(),True)]))),
              StructField("prc", StringType(),True)])))])))]))),
        StructField("top2", ArrayType(StructType([
          StructField("cl", ArrayType(StructType([
            StructField("nm", StringType(),True),
            StructField("prc", StringType(),True),
            StructField("tv", ArrayType(StructType([
              StructField("logo", StringType(),True),
              StructField("nm", StringType(),True),
              StructField("lk1", StringType(),True),
              StructField("lk2", StringType(),True),
              StructField("dta", ArrayType(StructType([
                StructField("nm", StringType(),True),
                StructField("cp", StringType(),True),
                StructField("vl", StringType(),True)]))),
              StructField("prc", StringType(),True)])))])))]))),  
        StructField("top3", ArrayType(StructType([
          StructField("cl", ArrayType(StructType([
            StructField("nm", StringType(),True),
            StructField("prc", StringType(),True),
            StructField("tv", ArrayType(StructType([
              StructField("logo", StringType(),True),
              StructField("nm", StringType(),True),
              StructField("lk1", StringType(),True),
              StructField("lk2", StringType(),True),
              StructField("dta", ArrayType(StructType([
                StructField("nm", StringType(),True),
                StructField("cp", StringType(),True),
                StructField("vl", StringType(),True)]))),
              StructField("prc", StringType(),True)])))])))])))])))])


df2 = sqlContext.read.json(df, schema)
df2.printSchema()

我收到这个消息: error message

我想改变这样的东西:

exemple of dataframe

是否有任何功能可以促进此中断并构建此数据


Tags: httpstestcomjsontruewwwcpnm
1条回答
网友
1楼 · 发布于 2024-04-28 07:42:32

您可以将JSON文件路径或RDD传递给JSON()方法

您需要使用parallelize()从JSON字符串中创建RDD,然后将此RDD传递给JSON()

spark = SparkSession.builder.master("local[*]").getOrCreate()
rdd = spark.sparkContext.parallelize([json.dumps(json_obj,indent=2)])
# Schema will be inferred automatically. You can pass schema if you want.
json_df = spark.read.json(rdd) 

相关问题 更多 >