解析一个文件中所有出现的字符串并在JSON中生成键值

2024-05-15 10:36:09 发布

您现在位置:Python中文网/ 问答频道 /正文

  1. 我有一个文件(https://pastebin.com/STgtBRS8),需要在其中搜索所有出现的单词“silencedetect”。

  2. 然后我必须生成一个JSON文件,其中包含“silence\u start”、“silence\u end”和“silence\u duration”的键值。

JSON文件应该如下所示:

[
{
"id": 1,
"silence_start": -0.012381,
"silence_end": 2.2059,
"silence_duration": 2.21828
},
{
"id": 2,
"silence_start": 5.79261,
"silence_end": 6.91955,
"silence_duration": 1.12694,
}
]

这就是我尝试过的:

with open('volume_data.csv', 'r') as myfile:
    data = myfile.read().replace('\n', '')

for line in data:
    if "silencedetect" in data:
        #read silence_start, silence_end, and silence_duration and put in json

我无法将3个键值对与每个“silencedetect”关联。如何解析键值并以JSON格式获取它们?你知道吗


Tags: and文件inhttpsidjsonreaddata
3条回答

假设您的数据是有序的,您可以简单地对其进行流式解析,完全不需要regex和加载整个文件:

import json

parsed = []  # a list to hold our parsed values
with open("entries.dat", "r") as f:  # open the file for reading
    current_id = 1  # holds our ID
    entry = None  # holds the current parsed entry
    for line in f:  # ... go through the file line by line
        if line[:14] == "[silencedetect":  # parse the lines starting with [silencedetect
            if entry:  # we already picked up silence_start
                index = line.find("silence_end:")  # find where silence_end starts
                value = line[index + 12:line.find("|", index)].strip()  # the number after it
                entry["silence_end"] = float(value)  # store the silence_end
                # the following step is optional, instead of parsing you can just calculate
                # the silence_duration yourself with:
                # entry["silence_duration"] = entry["silence_end"] - entry["silence_start"]
                index = line.find("silence_duration:")  # find where silence_duration starts
                value = line[index + 17:].strip()  # grab the number after it
                entry["silence_duration"] = float(value)  # store the silence_duration
                # and now that we have everything...
                parsed.append(entry)  # add the entry to our parsed list
                entry = None  # blank out the entry for the next step
            else:  # find silence_start first
                index = line.find("silence_start:")  # find where silence_start, well, starts
                value = line[index + 14:].strip()  # grab the number after it
                entry = {"id": current_id}  # store the current ID...
                entry["silence_start"] = float(value)  # ... and the silence_start
                current_id += 1  # increase our ID value for the next entry

# Now that we have our data, we can easily turn it into JSON and print it out if needed
your_json = json.dumps(parsed, indent=4)  # holds the JSON, pretty-printed
print(your_json)  # let's print it...

你会得到:

[
    {
        "silence_end": 2.2059, 
        "silence_duration": 2.21828, 
        "id": 1, 
        "silence_start": -0.012381
    }, 
    {
        "silence_end": 6.91955, 
        "silence_duration": 1.12694, 
        "id": 2, 
        "silence_start": 5.79261
    }, 
    {
        "silence_end": 9.12544, 
        "silence_duration": 0.59288, 
        "id": 3, 
        "silence_start": 8.53256
    }, 
    {
        "silence_end": 10.7276, 
        "silence_duration": 1.0805, 
        "id": 4, 
        "silence_start": 9.64712
    }, 
    # 
    # etc.
    # 
    {
        "silence_end": 795.516, 
        "silence_duration": 0.68576, 
        "id": 189, 
        "silence_start": 794.83
    }
]

请记住,JSON不订阅数据顺序(v3.5之前的Pythondict也不订阅),因此id不一定出现在第一位,但数据有效性是相同的。你知道吗

我特意分离了最初的entry创建,这样您就可以使用collections.OrderedDict作为替换(即entry = collections.OrderedDict({"id": current_id}))来保留顺序(如果您希望的话)。你知道吗

使用re.findallenumerate函数的复杂解决方案:

import re, json

with open('volume_data.txt', 'r') as f:
    result = []
    pat = re.compile(r'(silence_start: -?\d+\.\d+).+?(silence_end: -?\d+\.\d+).+?(silence_duration: -?\d+\.\d+)')
    silence_items = re.findall(pat, f.read().replace('\n', ''))
    for i,v in enumerate(silence_items):
        d = {'id': i+1}
        d.update({pair[:pair.find(':')]: float(pair[pair.find(':')+2:]) for pair in v})
        result.append(d)

    print(json.dumps(result, indent=4))

输出:

[
    {
        "id": 1,
        "silence_end": 2.2059,
        "silence_duration": 2.21828,
        "silence_start": -0.012381
    },
    {
        "id": 2,
        "silence_end": 6.91955,
        "silence_duration": 1.12694,
        "silence_start": 5.79261
    },
    {
        "id": 3,
        "silence_end": 9.12544,
        "silence_duration": 0.59288,
        "silence_start": 8.53256
    },
    {
        "id": 4,
        "silence_end": 10.7276,
        "silence_duration": 1.0805,
        "silence_start": 9.64712
    },
    {
        "id": 5,
        "silence_end": 13.6998,
        "silence_duration": 1.03406,
        "silence_start": 12.6657
    },
    {
        "id": 6,
        "silence_end": 20.1317,
        "silence_duration": 0.871519,
        "silence_start": 19.2602
    },
    {
        "id": 7,
        "silence_end": 22.4305,
        "silence_duration": 0.801859,
        "silence_start": 21.6286
    },
    ...
]

你可以用正则表达式。这对我很有用

import re

with open('volume_data.csv', 'r') as myfile:
    data = myfile.read()

d = re.findall('silence_start: (-?\d+\.\d+)\n.*?\n?\[silencedetect @ \w{14}\] silence_end: (-?\d+\.\d+) \| silence_duration: (-?\d+\.\d+)', data)
print d

您可以通过执行

out = [{'id': i, 'start':a[0], 'end':a[1], 'duration':a[2]} for i, a in enumerate(d)]
import json
print json.dumps(out) # or write to file or... whatever

输出:

'[{"duration": "2.21828", "start": "-0.012381", "end": "2.2059", "id": 0}, {"duration": "1.12694", "start": "5.79261", "end": "6.91955", "id": 1}, {"duration": "0.59288", "start": "8.53256", "end": "9.12544", "id": 2}, {"duration": "1.0805", "start": "9.64712", "end": "10.7276", "id": 3}, {"duration": "1.03406", "start": "12.6657", "end": "13.6998", "id": 4}, {"duration": "0.871519", "start": "19.2602", "end": "20.1317", "id": 5}'

编辑: 修复了由于frame=..行落在匹配的开始和结束之间而丢失一些匹配的错误

相关问题 更多 >