访问JSON文件中的嵌套数据以构建多个数据帧

2024-04-23 12:05:22 发布

您现在位置:Python中文网/ 问答频道 /正文

我目前正在访问一个JSON文件,以收集有关公司、分支机构和员工的信息。在这个JSON中有多个公司,每个公司都有自己的分支机构,每个分支机构都有自己的员工。下面是我正在研究的JSON结构示例

{
  "company": [
    {
      "companyName": "MyCompany",
      "branches": [
        {
          "branchName": "My First Branch",
          "workers": [
            {
              "workerName": "John",
              "wage": "10"
            },
            {
              "workerName": "Lucas",
              "wage": "20"
            }
          ]
        },
        {
          "branchName": "My Second Branch",
          "workers": [
            {
              "workerName": "Mary",
              "wage": "30"
            },
            {
              "workerName": "Jack",
              "wage": "40"
            }
          ]
        }
      ]
    },
    {
      "companyName": "YourCompany",
      "branches": [
        {
          "branchName": "Your First Branch",
          "workers": [
            {
              "workerName": "George",
              "wage": "15"
            },
            {
              "workerName": "Harry",
              "wage": "25"
            }
          ]
        },
        {
          "branchName": "Your Second Branch",
          "workers": [
            {
              "workerName": "Wayne",
              "wage": "35"
            },
            {
              "workerName": "Rose",
              "wage": "45"
            }
          ]
        }
      ]
    }
  ]
}

我的目标是在一个数据框架中为每个公司、分支机构和员工收集有关所有公司、分支机构和员工的信息。为此,我目前正在使用循环,如下所示

Companies = my_json['Companies'] #accessing list of companies
Branches = None
Workers = None
for i in range(len(Companies)):
  company_branches = Companies[i]['Branches'] #accessing branches for that company
  if(Branches is None):
    Branches = pd.DataFrame(company_branches)
  else:
    Branches = pd.concat([Branches,pd.DataFrame(company_branches)])
  for j in range(len(company_branches)):
    branch_workers = company_branches[j]['Workers'] #accessing workers for that branch
    if(Workers is None):
      Workers = pd.DataFrame(branch_workers)
    else:
      Workers = pd.concat([Workers,pd.DataFrame(branch_workers)])
Companies = pd.DataFrame(Companies)

这解决了我的问题,创建了三个所需的数据帧(为公司和分支机构添加了一些额外的列,我仍然会删除这些列),但我目前遇到了性能问题。我一直试图在没有循环的情况下解决这个问题,但我无法正确创建数据帧。如果我尝试

Companies = pd.DataFrame(my_json['companies'])

它正确地创建了数据帧,但是如果我尝试

Branches = pd.DataFrame(Companies.branches.values)

它没有正确创建。它不会引发错误,但基本上是从companys数据框架复制列,为每个公司的分支创建一个包含JSON代码的列。在本例中,我想要的是与分支具有的属性数量一样多的列

关于如何有效地解决这个问题,有什么建议吗


Tags: 数据jsondataframe员工公司companypdcompanies
2条回答

您可能应该将所有数据保存在一个数据框中(整齐的数据)。如果需要,您可以获得公司或分支机构的数据框架,例如pd.DataFrame(df.company.unique(), columns='company name')

下面的方法使用嵌套列表理解来展平数据。它还将wage记录从字符串转换为浮点

df = pd.DataFrame(
    [(company.get('companyName'), branch.get('branchName'), 
      worker.get('workerName'), float(worker.get('wage', 0))) 
     for company in my_json['company']
     for branch in company['branches'] 
     for worker in branch.get('workers')
], columns=['company', 'branch', 'worker', 'wage'])

>>> df
       company              branch  worker  wage
0    MyCompany     My First Branch    John  10.0
1    MyCompany     My First Branch   Lucas  20.0
2    MyCompany    My Second Branch    Mary  30.0
3    MyCompany    My Second Branch    Jack  40.0
4  YourCompany   Your First Branch  George  15.0
5  YourCompany   Your First Branch   Harry  25.0
6  YourCompany  Your Second Branch   Wayne  35.0
7  YourCompany  Your Second Branch    Rose  45.0

这是我对这个问题的广义解决方案。假设我将您提供的整个json定义为data,那么:

to_df = {'companyName':[],'branchName':[],'workerName':[],'wage':[]}
for i in range(len(data['company'])):
    for j in range(len(data['company'][i]['branches'])):
        for k in range(len(data['company'][i]['branches'][j]['workers'])):
            to_df['companyName'].append(data['company'][i]['companyName'])
            to_df['branchName'].append(data['company'][i]['branches'][j]['branchName'])
            to_df['workerName'].append(data['company'][i]['branches'][j]['workers'][k]['workerName'])
            to_df['wage'].append(data['company'][i]['branches'][j]['workers'][k]['wage'])
df = pd.DataFrame(to_df)
print(df)

输出:

   companyName          branchName workerName wage
0    MyCompany     My First Branch       John   10
1    MyCompany     My First Branch      Lucas   20
2    MyCompany    My Second Branch       Mary   30
3    MyCompany    My Second Branch       Jack   40
4  YourCompany   Your First Branch     George   15
5  YourCompany   Your First Branch      Harry   25
6  YourCompany  Your Second Branch      Wayne   35
7  YourCompany  Your Second Branch       Rose   45

您可以检查这个答案,以获得有关嵌套JSON的一些额外信息:How to extract nested JSON data?

相关问题 更多 >