经常会有一段JSON数据出现,这是一个挑战,需要花费数小时才能从中提取所需的信息。我有一个从语音到文本API引擎生成的以下JSON响应。在
它显示会话中每个说话人speaker 0
和speaker 2
的抄本、每个单词的话语以及时间戳和说话人标签。在
{
"results": [
{
"alternatives": [
{
"timestamps": [
[
"the",
6.18,
6.63
],
[
"weather",
6.63,
6.95
],
[
"is",
6.95,
7.53
],
[
"sunny",
7.73,
8.11
],
[
"it's",
8.21,
8.5
],
[
"time",
8.5,
8.66
],
[
"to",
8.66,
8.81
],
[
"sip",
8.81,
8.99
],
[
"in",
8.99,
9.02
],
[
"some",
9.02,
9.25
],
[
"cold",
9.25,
9.32
],
[
"beer",
9.32,
9.68
]
],
"confidence": 0.812,
"transcript": "the weather is sunny it's time to sip in some cold beer "
}
],
"final": "True"
},
{
"alternatives": [
{
"timestamps": [
[
"sure",
10.52,
10.88
],
[
"that",
10.92,
11.19
],
[
"sounds",
11.68,
11.82
],
[
"like",
11.82,
12.11
],
[
"a",
12.32,
12.96
],
[
"plan",
12.99,
13.8
]
],
"confidence": 0.829,
"transcript": "sure that sounds like a plan"
}
],
"final": "True"
}
],
"result_index":0,
"speaker_labels": [
{
"from": 6.18,
"to": 6.63,
"speaker": 0,
"confidence": 0.475,
"final": "False"
},
{
"from": 6.63,
"to": 6.95,
"speaker": 0,
"confidence": 0.475,
"final": "False"
},
{
"from": 6.95,
"to": 7.53,
"speaker": 0,
"confidence": 0.475,
"final": "False"
},
{
"from": 7.73,
"to": 8.11,
"speaker": 0,
"confidence": 0.499,
"final": "False"
},
{
"from": 8.21,
"to": 8.5,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 8.5,
"to": 8.66,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 8.66,
"to": 8.81,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 8.81,
"to": 8.99,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 8.99,
"to": 9.02,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 9.02,
"to": 9.25,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 9.25,
"to": 9.32,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 9.32,
"to": 9.68,
"speaker": 0,
"confidence": 0.472,
"final": "False"
},
{
"from": 10.52,
"to": 10.88,
"speaker": 2,
"confidence": 0.441,
"final": "False"
},
{
"from": 10.92,
"to": 11.19,
"speaker": 2,
"confidence": 0.364,
"final": "False"
},
{
"from": 11.68,
"to": 11.82,
"speaker": 2,
"confidence": 0.372,
"final": "False"
},
{
"from": 11.82,
"to": 12.11,
"speaker": 2,
"confidence": 0.372,
"final": "False"
},
{
"from": 12.32,
"to": 12.96,
"speaker": 2,
"confidence": 0.383,
"final": "False"
},
{
"from": 12.99,
"to": 13.8,
"speaker": 2,
"confidence": 0.428,
"final": "False"
}
]
}
请原谅缩进问题(如果有的话),但是JSON是有效的,我一直在尝试用对应的speaker标签映射每个脚本。在
我想要下面这样的东西。上面的JSON大约有20000行,根据时间戳和单词发音提取说话人标签并将其与transcript
放在一起是一场噩梦。在
我目前所做的努力:
JSON数据存储在一个名为example.json
的文件中。我已经能够将每个单词及其对应的时间戳和说话人标签放入元组列表中(参见下面的输出):
import json
# with open('C:\\Users\\%USERPROFILE%\\Desktop\\example.json', 'r') as f:
# data = json.load(f)
l1 = []
l2 = []
l3 = []
for i in data['results']:
for j in i['alternatives'][0]['timestamps']:
l1.append(j)
for m in data['speaker_labels']:
l2.append(m)
for q in l1:
for n in l2:
if q[1]==n['from']:
l3.append((q[0],n['speaker'], q[1], q[2]))
print(l3)
这将产生以下输出:
[('the', 0, 6.18, 6.63),
('weather', 0, 6.63, 6.95),
('is', 0, 6.95, 7.53),
('sunny', 0, 7.73, 8.11),
("it's", 0, 8.21, 8.5),
('time', 0, 8.5, 8.66),
('to', 0, 8.66, 8.81),
('sip', 0, 8.81, 8.99),
('in', 0, 8.99, 9.02),
('some', 0, 9.02, 9.25),
('cold', 0, 9.25, 9.32),
('beer', 0, 9.32, 9.68),
('sure', 2, 10.52, 10.88),
('that', 2, 10.92, 11.19),
('sounds', 2, 11.68, 11.82),
('like', 2, 11.82, 12.11),
('a', 2, 12.32, 12.96),
('plan', 2, 12.99, 13.8)]
但现在我不知道如何根据时间戳比较将单词关联起来,并将每组单词“bucket”与说话人标签重新组合起来。在
我还设法得到了一个列表中的成绩单,但是现在如何从上面的列表中提取每个成绩单的说话人标签。说话人标签speaker 0
和speaker 2
是针对每个单词的不幸的,我希望它们是针对每个transcript
的。在
for i in data['results']:
l4.append(i['alternatives'][0]['transcript'])
这将产生以下输出:
["the weather is sunny it's time to sip in some cold beer ",'sure that sounds like a plan']
我已经尽力解释这个问题,但是我愿意接受任何反馈,如果有必要,我会做出改变。另外,我很确定有更好的方法来解决这个问题,而不是列出几个清单,任何帮助都是非常感谢的。在
对于更大的数据集,请参阅pastebin。我希望这个数据集能对性能的基准测试有所帮助。我可以提供一个更大的数据集,如果有需要的话。在
当我处理大型JSON数据时,性能是一个重要因素,同样,在重叠的转录中准确地实现说话人隔离也是另一个要求。在
我试过用JS做什么 看看这是否与使用python类似
我根据单词的时间戳将单词放入dict中,然后他们将单词与说话者匹配:
它在提供的示例上运行了1000000次,时间大约为12.34秒,因此希望它足够快以满足您的需要。在
利用熊猫,我刚才是怎么解决的。在
假设数据存储在名为
data
的字典中在连接说话人和单词数据后,需要将同一说话人的连续单词组合在一起,以导出当前说话人。例如,如果扬声器数组看起来像[2,2,2,2,0,0,0,2,2,2,0,0,0],我们需要将前四个}组合在一起。在
2
组合在一起,然后将第三个0
组合在一起,然后将三个2
和剩余的{按
^{pr2}$['from', 'to']
对数据进行排序,然后为此设置一个称为current_speaker
的伪变量,如下所示:从这里开始,按
current_speaker
分组,将单词聚合成一个句子&转换为json。有一点额外的重命名来修复输出json键要在脚本开始/结束时添加其他数据,可以将from/To的最小值/最大值添加到groupby中
另外,(虽然这不适用于这个示例数据集),您可能应该为每个时间片选择置信度最高的替代方案。
相关问题 更多 >
编程相关推荐