从RESTAPI到pyspark数据帧的嵌套json

2024-04-24 14:36:57 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试创建一个数据管道,从RESTAPI请求数据。输出是一个嵌套的json文件,非常好。我想将json文件读入pyspark数据帧。当我将文件保存在本地并使用以下代码时,这可以正常工作:

from pyspark.sql import *
from pyspark.sql.functions import *

spark = SparkSession\
    .builder\
    .appName("jsontest")\
    .getOrCreate()

raw_df = spark.read.json(r"my_json_path", multiLine='true')

但是,当我想在发出API请求后直接生成pyspark数据帧时,会出现以下错误:

Error when trying to create a pyspark dataframeenter image description here

我使用以下代码调用rest api并将其转换为pyspark数据帧:

apiCallHeaders = {'Authorization': 'Bearer ' + bearer_token}
apiCallResponse = requests.get(data_url, headers=apiCallHeaders, verify=True)
json_rdd = spark.sparkContext.parallelize(apiCallResponse.text)
raw_df = spark.read.json(json_rdd)

以下是一些响应输出

{"networks":[{"href":"/v2/networks/velobike-moscow","id":"velobike-moscow","name":"Velobike"},{"href":"/v2/networks/bycyklen","id":"bycyklen","name":"Bycyklen"},{"href":"/v2/networks/nu-connect","id":"nu-connect","name":"Nu-Connect"},{"href":"/v2/networks/baerum-bysykkel","id":"baerum-bysykkel","name":"Bysykkel"},{"href":"/v2/networks/bysykkelen","id":"bysykkelen","name":"Bysykkelen"},{"href":"/v2/networks/onroll-a-rua","id":"onroll-a-rua","name":"Onroll"},{"href":"/v2/networks/onroll-albacete","id":"onroll-albacete","name":"Onroll"},{"href":"/v2/networks/onroll-alhama-de-murcia","id":"onroll-alhama-de-murcia","name":"Onroll"},{"href":"/v2/networks/onroll-almunecar","id":"onroll-almunecar","name":"Onroll"},{"href":"/v2/networks/onroll-antequera","id":"onroll-antequera","name":"Onroll"},{"href":"/v2/networks/onroll-aranda-de-duero","id":"onroll-aranda-de-duero","name":"Onroll"}

我希望我的问题有意义,有人能帮助我

提前谢谢


Tags: 文件数据代码namefromidjsonde
1条回答
网友
1楼 · 发布于 2024-04-24 14:36:57

在此answer之后,您可以添加以下行:

import os
import sys

os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

要运行代码,必须在此处添加[ ]

rdd = spark.sparkContext.parallelize([apiCallResponse.text])

请参见一个示例:

import requests

response = requests.get('http://api.citybik.es/v2/networks?fields=id,name,href')
rdd = spark.sparkContext.parallelize([response.text])

df = spark.read.json(rdd)

df.printSchema()
# root
#  |  networks: array (nullable = true)
#  |    |  element: struct (containsNull = true)
#  |    |    |  href: string (nullable = true)
#  |    |    |  id: string (nullable = true)
#  |    |    |  name: string (nullable = true)

(df
 .selectExpr('inline(networks)')
 .show(n=5, truncate=False))
# +              +       -+     +
# |href                        |id             |name      |
# +              +       -+     +
# |/v2/networks/velobike-moscow|velobike-moscow|Velobike  |
# |/v2/networks/bycyklen       |bycyklen       |Bycyklen  |
# |/v2/networks/nu-connect     |nu-connect     |Nu-Connect|
# |/v2/networks/baerum-bysykkel|baerum-bysykkel|Bysykkel  |
# |/v2/networks/bysykkelen     |bysykkelen     |Bysykkelen|
# +              +       -+     +

相关问题 更多 >