将JSON对象转换为数据框

1 投票
2 回答
61 浏览
提问于 2025-04-12 01:16

我对如何把JSON对象转换成数据框(DataFrame)不太熟悉。希望能找到一些更有效的方法来进行这种转换。

当我遍历我的input_df的两行数据时,打印suggestion_string会得到两个JSON对象,具体如下:

这里插入图片描述

我期望的输出结果是一个数据框,内容如下:

建议1 原因1 建议2 原因2 建议3 原因3
自由文本 自由文本 自由文本 自由文本 自由文本 自由文本
自由文本 自由文本 自由文本 自由文本 自由文本 自由文本

我尝试过以下代码,确实有效;但我希望能找到一些更高效的转换方法。

suggestion_list = []

for i in range(len(input_df)):

    description = input_df.loc[i, 'description']
    
    query = fr"""
    sample text ... {description}?
    """
    
    suggestion_string = return_json_object(query)
    string = suggestion_string.replace("`",'')
    string = string.replace('json', '')
    str_list = string.split('\n')
    dict_str = ''.join(str_list)
    output_dict = json.loads(dict_str)
    suggestion_list.append(output_dict)

lists_of_dicts = []
for dict in suggestion_list:
    list_of_dicts.append(dict['suggestions'])

flat_data = []
for sublist in list_of_dicts:
    row_data = {}
    for i, item in enumerate(sublist):
        row_data[f'suggestion{i+1}'] = item['suggestion']
        row_data[f'reason{i+1}'] = item['reason']
    flat_data.append(row_data)

suggestion_df = pd.DataFrame(flat_data)

2 个回答

0

这里有一个函数和一个示例的json文件,可以用来测试这个函数。这个函数适用于任何类型的json,无论里面有没有嵌套的字段。

data = [
    {
        "id": 1,
        "name": "John Doe",
        "email": "john@example.com",
        "location": {
            "city": "New York",
            "country": "USA"
        },
        "skills": ["Python", "Data Analysis"]
    },
    {
        "id": 2,
        "name": "Jane Doe",
        "email": "jane@example.com",
        "location": {
            "city": "San Francisco",
            "country": "USA"
        },
        "skills": ["JavaScript", "React"]
    }
]

df = pd.DataFrame(data)

def flatten_nested_json_df(df):
    df = df.reset_index()
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()
    
    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    return df

如何使用它

flatten_nested_json_df(df)

返回结果

   index  id      name             email  location.city location.country  \
0      0   1  John Doe  john@example.com       New York              USA   
0      0   1  John Doe  john@example.com       New York              USA   
1      1   2  Jane Doe  jane@example.com  San Francisco              USA   
1      1   2  Jane Doe  jane@example.com  San Francisco              USA   

          skills  
0         Python  
0  Data Analysis  
1     JavaScript  
1          React
1

我创建了一些示例建议字符串——不过你可以通过修改下面的代码,结合你的 input_df 逻辑,继续你的循环。我为了简单起见,直接写死了这些建议字符串。

代码:

import pandas as pd
import json

suggestion_string_1 = '''
```
json
{
    "suggestions": [
        {"suggestion": "free text", "reason": "free text"},
        {"suggestion": "free text", "reason": "free text"},
        {"suggestion": "free text", "reason": "free text"}
    ]
}
```
'''

suggestion_string_2 = '''
```
json
{
    "suggestions": [
        {"suggestion": "free text", "reason": "free text"},
        {"suggestion": "free text", "reason": "free text"},
        {"suggestion": "free text", "reason": "free text"}
    ]
}
```
'''

suggestion_list = []
input_df = [suggestion_string_1, suggestion_string_2]

for i in range(len(input_df)):
    # description = input_df.loc[i, 'description']
    #
    # query = fr"""
    # sample text ... {description}?
    # """
    #
    # suggestion_string = return_json_object(query)

    # I have hard-coded the suggestion_string for brevity - 
    # Assuming you are getting the same strings that I have 
    # hard-coded as per the screenshot shared by you.
    suggestion_string = input_df[i] # So, you can just replace this line with your logic for getting the suggestion string.
    suggestions = json.loads(
        suggestion_string[suggestion_string.find('{'):suggestion_string.rfind('}') + 1]
    )['suggestions']
    suggestion_list.append(suggestions)

suggestion_df = pd.DataFrame()

for idx, suggestions in enumerate(suggestion_list):
    for i, suggestion in enumerate(suggestions):
        suggestion_df.loc[idx, f'suggestion{i+1}'] = suggestion['suggestion']
        suggestion_df.loc[idx, f'reason{i+1}'] = suggestion['reason']
        
print(suggestion_df)

输出:

建议1 原因1 建议2 原因2 建议3 原因3
自由文本 自由文本 自由文本 自由文本 自由文本 自由文本
自由文本 自由文本 自由文本 自由文本 自由文本 自由文本

撰写回答