无法将json转换为datafram

2024-03-28 14:42:52 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试将一个巨大的JSON文件转换为一个数据帧,以便对其进行预处理分析。但是无法转换。你知道吗

问题出在pd.read\u json文件你知道吗

import json
import pandas as pd

with open("/content/drive/My Drive/timeline_1.jsonl") as f:
    data = f.readlines()
    data_json_str = "[" + ','.join(data) + "]"
    data_df = pd.read_json(data_json_str)

ValueError:解码“string”时出现不匹配的“”“”


Tags: 文件数据importjsonpandasreaddataas
2条回答

使用pandas.io.json.json_normalize

数据:

  • 在名为test.json的文件中以listdicts形式给出数据
[{
        "id": "99014576299056245",
        "created_at": "2017-11-16T14:28:53.919Z",
        "sensitive": false,
        "spoiler_text": "",
        "language": "en",
        "uri": "mastodon.gamedev.place/users/jaggy/statuses/99014576299056245",
        "instance": "mastodon.gamedev.place",
        "content": "<p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>",
        "account_id": "434",
        "tag_list": [],
        "media_attachments": [],
        "emojis": [],
        "mentions": []
    }, {
        "id": "99014544879467317",
        "created_at": "2017-11-16T14:20:54.462Z",
        "sensitive": false,
        "spoiler_text": "",
        "language": "en",
        "uri": "mastodon.gamedev.place/users/jaggy/statuses/99014544879467317",
        "instance": "mastodon.gamedev.place",
        "content": "<p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>",
        "account_id": "434",
        "tag_list": [],
        "media_attachments": [],
        "emojis": [],
        "mentions": []
    }
]

读取数据的代码:

import pandas as pd
import json
from pathlib import Path
from pandas.io.json import json_normalize

# path to file
p = Path(r'c:\some_directory_with_data\test.json')

# read the file in and load using the json module
with p.open('r', encoding='utf-8') as f:
    data = json.loads(f.read())

# create a dataframe
df = json_normalize(data)

# dataframe view
                id                created_at  sensitive spoiler_text language                                                            uri                instance                                                                                                                       content account_id tag_list media_attachments emojis mentions
 99014576299056245  2017-11-16T14:28:53.919Z      False                    en  mastodon.gamedev.place/users/jaggy/statuses/99014576299056245  mastodon.gamedev.place  <p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>        434       []                []     []       []
 99014544879467317  2017-11-16T14:20:54.462Z      False                    en  mastodon.gamedev.place/users/jaggy/statuses/99014544879467317  mastodon.gamedev.place  <p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>        434       []                []     []       []

方案2:

数据

  • 数据以dict行的形式存在于一个文件中
    • 不在列表中
    • 用换行符分开
  • 这不是有效的JSON文件
{"id": "99014576299056245", "created_at": "2017-11-16T14:28:53.919Z", "sensitive": false, "spoiler_text": "", "language": "en", "uri": "mastodon.gamedev.place/users/jaggy/statuses/99014576299056245", "instance": "mastodon.gamedev.place", "content": "<p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>", "account_id": "434", "tag_list": [], "media_attachments": [], "emojis": [], "mentions": []}
{"id": "99014544879467317", "created_at": "2017-11-16T14:20:54.462Z", "sensitive": false, "spoiler_text": "", "language": "en", "uri": "mastodon.gamedev.place/users/jaggy/statuses/99014544879467317", "instance": "mastodon.gamedev.place", "content": "<p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>", "account_id": "434", "tag_list": [], "media_attachments": [], "emojis": [], "mentions": []}

读取此数据的代码

  • 使用以下代码读取中的文件
    • data将是str的列表,其中文件的每一行都是列表中的str
    • 使用ast.literal_evalstr转换回dict
    • ^如果str中存在无效值,{}将不起作用(例如,false代替false,true代替true)。你知道吗
    • 这将导致ValueError: malformed node or string: <_ast.Name object at 0x000002B7240B7888>,这不是一个特别有用的错误
  • 我已经添加了一个try-except块来打印引起问题的任何行,添加到values_to_fixdict直到您得到所有行。你知道吗
import pandas as pd
import json
from pathlib import Path
from pandas.io.json import json_normalize
from ast import literal_eval

# path to file
p = Path(r'c:\some_directory_with_data\test.json')

list_of_dicts = list()
with p.open('r', encoding='utf-8') as f:
    data = f.readlines()
    for x in data:
        values_to_fix = {'false': 'False',
                         'true': 'True',
                         'none': 'None'}
        for k, v in values_to_fix.items():
            x = x.replace(k, v)
        try:
            x = literal_eval(x)
            list_of_dicts.append(x)
        except ValueError as e:
            print(e)
            print(x)

df = json_normalize(list_of_dicts)

# this output is the same as that shown above

您的数据可能已损坏,至少在一个地方(可能更多)。你知道吗

找到这样一个位置的一种方法是运行代码,而不是在整个文件上, 但是在它的大块上。你知道吗

例如,在以下对象上运行代码:

  • 你档案的前半部分
  • 下半场。你知道吗

如果任何部件运行正常,则没有错误。你知道吗

下一步是对每个“失败”块重复上述过程。你知道吗

另一种方法:仔细查看StackTrace,可能在某个地方 源文件中的行号(不要与行号混淆) Python代码)。你知道吗

现在,您将整个文本组合成一行,因此即使StackTrace 包含这样的数字,很可能只有1。你知道吗

为了简化您的调查,请以这样一种方式更改您的代码 源行位于连接文本的单独的行中。比如:

data_json_str = "[" + ',\n'.join(data) + "]"

然后再次执行代码并读取显示的数字(错误发生的位置), 现在等于源行数。你知道吗

然后看这一行,纠正它,你的代码应该运行没有错误。你知道吗

使用源数据在注释后编辑

在你的数据中我注意到:

  • 它包含两个JSON对象(行)
  • 但是它们之间没有逗号。你知道吗

我做了以下补充和修改:

  • 在开头/结尾添加[]
  • 在第一个{…}后面加了一个逗号。你知道吗

所以输入字符串是:

data_json_str = '''[
{"id": "99014576299056245", "created_at": "2017-11-16T14:28:53.919Z",
 "sensitive": false, "spoiler_text": "", "language": "en",
 "uri": "mastodon.gamedev.place/users/jaggy/statuses/99014576299056245",
 "instance": "mastodon.gamedev.place",
 "content": "<p>Coding a cheeky skill before bed. Not as much as I&apos;d like but had drinks with co-workers after work so shrug ^_^</p>",
 "account_id": "434", "tag_list": [], "media_attachments": [], "emojis": [], "mentions": []},
{"id": "99014544879467317", "created_at": "2017-11-16T14:20:54.462Z", "sensitive": false}
]'''

然后执行指令读取此字符串:

data_df = pd.read_json(data_json_str)

得到了一个有2行的数据帧(没有错误)。 起初我怀疑&apos;可能是错误的来源,但是读取json 我也处理过这个案子。你知道吗

但是当我删除第一个{…}后面的逗号时,我得到了一个错误:

ValueError: Unexpected character found when decoding array value (2)

(你的错误除外)。你知道吗

我使用Python3.7.0和Pandas0.25。 如果您有PythonPandas的旧版本,也许您应该 升级它们?你知道吗

真正的问题可能与JSON中的一些“弱点”有关 解析器(我不确定它是Python的一部分还是Pandas的一部分)。你知道吗

在升级之前,执行另一个测试:从 输入字符串并尝试再次读取json。你知道吗

如果你这次得到一个适当的结果,这将证实我的怀疑 您安装的JSON解析器有缺陷,将成为 重要的支持我的建议升级你的软件。你知道吗

相关问题 更多 >