从JSON嵌套数组中获取数据并将其放入excel

2024-06-16 09:34:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个JSON文件,其中包含一组工作描述:

{
 "name": "Messenger_group",
 "id": 85648902334,
 "messages": [
  {
   "id": 20482,
   "type": "message",
   "date": "2020-12-04T16:34:40",
   "from": "IFTTT",
   "from_id": 4535011322,
   "text": [
    "Random job description.\n\n",
    {
     "type": "bold",
     "text": "Budget"
    },
    ": $500\n",
    {
     "type": "bold",
     "text": "Posted On"
    },
    ": December 04, 2020 13:28 UTC\n",
    {
     "type": "bold",
     "text": "Category"
    },
    ": UX/UI Design\n",
    {
     "type": "link",
     "text": "https://url.com/"
    }
   ]
  },
  {
   "id": 21144,
   "type": "message",
   "date": "2020-12-06T01:04:50",
   "from": "IFTTT",
   "from_id": 4535011322,
   "text": [
    "Random job description.\n\n",
    {
     "type": "bold",
     "text": "Hourly Range"
    },
    ": $13.00-$35.00\n",
    {
     "type": "bold",
     "text": "Posted On"
    },
    ": December 05, 2020 21:31 UTC\n",
    {
     "type": "bold",
     "text": "Country"
    },
    ": Serbia\n",
    {
     "type": "link",
     "text": "https://url.com"
    }
   ]
  }
 ]
}

我想对其进行解析,并从每条消息的“文本”数组中提取某些值。 然后将它们放入excel文件中

对我来说,最具挑战性的部分是如何从这些阵列内部连接项目。 例如,我需要获取“小时范围”部分,在excel中将其作为列标题,然后获取下一项“$13.00-$35.00\n”,并将其放在同一列的下面,依此类推。 如果这还不够简单,那么有些消息具有不同的值集。就像有些人有“小时工资”,有些人有“固定预算”,所以我也需要考虑一下

最后,我想要一张类似这样的桌子:

Example Table

我主要使用Python,所以我的问题是,我可以使用哪种方法或库来做类似的事情


Tags: 文件textfromidmessagedateontype
3条回答

您可能需要json来读取数据,csv或其他模块来写入数据,但其余的则不需要任何额外的模块,只需要for循环和dict/list函数


最后你有完整的工作示例


我使用json将JSON字符串转换为Python数据

import json

data = json.loads(text)

然后我可以使用for-loop来处理数据,并将其转换为行列表,每一行都将是字典:

{
   'Id': ..., 
   'Date': ..., 
   'Budget': ..., 
   'Category': ..., 
   'Country': ..., 
   'Posted On': ..., 
   'Hourly Range': ...
}

我使用for循环分别处理每个messages,并为列表中可能不存在的元素创建具有默认值的行text

all_rows = []

for msg in data['messages']:
    row = {
        'Id': msg['id'],
        'Date': msg['date'],
        'Budget': None,       # default value if it not in msg
        'Category': None,     # default value if it not in msg
        'Country': None,      # default value if it not in msg
        'Posted On': None,    # default value if it not in msg
        'Hourly Range': None, # default value if it not in msg
    }

我使用iter()将列表text转换为迭代器,然后我可以将它与zip(it, it)一起使用来创建像第一个元素{"type": "bold", "text": "Budget"}和第二个元素": $500\\n"这样的对。然后我可以创建pair"Budget"$500以及kee inrow

text = msg['text']

it = iter(text) # to create pairs with `zip` 
next(it)        # skip first element: "Random job description.\\n\\n"

for x, y in zip(it, it):  # work with pairs
    key = x['text']                      # ie. "Budget"
    value = y.strip().replace(': ', '')  # ie. "$500"
    row[key] = value

all_rows.append(row)  # keep this row on list

之后,我有一个包含所有行的列表:

[
 {'Id': 20482, 'Date': '2020-12-04T16:34:40', 'Budget': '$500', 'Category': 'UX/UI Design', 'Country': None, 'Posted On': 'December 04, 2020 13:28 UTC', 'Hourly Range': None}, 
 {'Id': 21144, 'Date': '2020-12-06T01:04:50', 'Budget': None, 'Category': None, 'Country': 'Serbia', 'Posted On': 'December 05, 2020 21:31 UTC', 'Hourly Range': '$13.00-$35.00'}
]

现在我可以使用csv将其写入文件CSV,我可以在ExcelLibreOffice或任何其他类似程序(或Python模块pandas)中读取该文件

import csv

headers = ['Id', 'Date', 'Budget', 'Category', 'Country', 'Posted On', 'Hourly Range']

with open('output.csv', 'w') as fh:
    csv_writer = csv.DictWriter(fh, headers)

    csv_writer.writeheader()
    csv_writer.writerows(all_rows)

最小工作示例

text = '''
{
 "name": "Messenger_group",
 "id": 85648902334,
 "messages": [
  {
   "id": 20482,
   "type": "message",
   "date": "2020-12-04T16:34:40",
   "from": "IFTTT",
   "from_id": 4535011322,
   "text": [
    "Random job description.\\n\\n",
    {
     "type": "bold",
     "text": "Budget"
    },
    ": $500\\n",
    {
     "type": "bold",
     "text": "Posted On"
    },
    ": December 04, 2020 13:28 UTC\\n",
    {
     "type": "bold",
     "text": "Category"
    },
    ": UX/UI Design\\n",
    {
     "type": "link",
     "text": "https://url.com/"
    }
   ]
  },
  {
   "id": 21144,
   "type": "message",
   "date": "2020-12-06T01:04:50",
   "from": "IFTTT",
   "from_id": 4535011322,
   "text": [
    "Random job description.\\n\\n",
    {
     "type": "bold",
     "text": "Hourly Range"
    },
    ": $13.00-$35.00\\n",
    {
     "type": "bold",
     "text": "Posted On"
    },
    ": December 05, 2020 21:31 UTC\\n",
    {
     "type": "bold",
     "text": "Country"
    },
    ": Serbia\\n",
    {
     "type": "link",
     "text": "https://url.com"
    }
   ]
  }
 ]
}
'''

#  - read  -

import json

data = json.loads(text)

#  - process  -

all_rows = []

headers = ['Id', 'Date', 'Budget', 'Category', 'Country', 'Posted On', 'Hourly Range']

for msg in data['messages']:
    row = {
        'Id': msg['id'],
        'Date': msg['date'],
        'Budget': None,       # default value if it not in msg
        'Category': None,     # default value if it not in msg
        'Country': None,      # default value if it not in msg
        'Posted On': None,    # default value if it not in msg
        'Hourly Range': None, # default value if it not in msg
    }

    text = msg['text']

    it = iter(text)  # to create pairs with `zip` 
    next(it)         # skip first element

    for x, y in zip(it, it):
        key = x['text']
        value = y.strip().replace(': ', '')
        row[key] = value

    all_rows.append(row)

    for key,value in row.items():
        print(key, ':', value)
    print(' -')

#  - write  -

import csv

with open('output.csv', 'w') as fh:
    csv_writer = csv.DictWriter(fh, headers)

    csv_writer.writeheader()
    csv_writer.writerows(all_rows)

下面是我在更新了一条非常有用的@furas评论后得出的结论

text = '''
{
 "name": "Messenger_group",
 "id": 85648902334,
 "messages": [
  {
   "id": 20482,
   "type": "message",
   "date": "2020-12-04T16:34:40",
   "from": "IFTTT",
   "from_id": 4535011322,
   "text": [
    "Random job description.\\n\\n",
    {
     "type": "bold",
     "text": "Budget"
    },
    ": $500\\n",
    {
     "type": "bold",
     "text": "Posted On"
    },
    ": December 04, 2020 13:28 UTC\\n",
    {
     "type": "bold",
     "text": "Category"
    },
    ": UX/UI Design\\n",
    {
     "type": "link",
     "text": "https://url.com/"
    }
   ]
  },
  {
   "id": 21144,
   "type": "message",
   "date": "2020-12-06T01:04:50",
   "from": "IFTTT",
   "from_id": 4535011322,
   "text": [
    "Random job description.\\n\\n",
    {
     "type": "bold",
     "text": "Hourly Range"
    },
    ": $13.00-$35.00\\n",
    {
     "type": "bold",
     "text": "Posted On"
    },
    ": December 05, 2020 21:31 UTC\\n",
    {
     "type": "bold",
     "text": "Country"
    },
    ": Serbia\\n",
    {
     "type": "link",
     "text": "https://url.com"
    }
   ]
  }
 ]
}
'''

import json
data = json.loads(text)

all_rows = []

include_params = [
    'Id',
    'Date',
    'Budget',
    'Category',
    'Country',
    'Posted On',
    'Hourly Range',
    'link',
]

for msg in data['messages']:
    row = {}

    if 'Id' in include_params:
        row['Id'] = msg['id']
    if 'Date' in include_params:
        row['Date'] = msg['date']

    text = msg['text']
    it = iter(text)
    for i in it:
        if type(i) == dict:
            if i['text'] in include_params:
                key = i['text']
                value = next(it).strip().replace(': ', '')
                row[key] = value
            if 'https://url' in i['text'] and 'link' in include_params:
                key = i['type']
                value = i['text']
                row[key] = value

    all_rows.append(row)

    for key,value in row.items():
        print(key, ':', value)
    print(' -')


import csv

headers = include_params

with open('output_test.csv', 'w') as fh:
    csv_writer = csv.DictWriter(fh, headers)

    csv_writer.writeheader()
    csv_writer.writerows(all_rows)

我在一次实习期间也遇到了类似的问题。JSON是嵌套的,我想把它转换成一个表,但它不是直接的。而且我的JSON文件很大

我使用了ijsonlibrary。它允许您以迭代方式解析json文件,并返回前缀、事件、值元组

例如,这是我解析JSON时的输出: ijson output

理解json结构后,您可以将前缀和事件值关联起来,并设计一种算法将某些特定值写入CSV行

不确定这是否是最理想的方法,但它确实对我有效

相关问题 更多 >