在Tensorflow Python数组修改中运行语音模型

2024-06-16 14:29:15 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试运行一个使用MFCC和Google语音数据集训练的模型。使用前2个jupyter笔记本对模型进行了训练

现在,我试图用Tensorflow 1.15.2将它实现到一个Raspberry Pi上,注意它也在TF 1.15.2中训练过。模型加载后,我得到了正确的模型。摘要()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 15, 15, 32)        160       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 7, 7, 32)          0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 6, 6, 32)          4128      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 3, 3, 32)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 2, 2, 64)          8256      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 1, 1, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 64)                0         
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 16,769
Trainable params: 16,769
Non-trainable params: 0

我的程序接收1秒的音频条带并输出一个wav文件,然后打开该文件(不确定如何使用数据),并将其转换为张量字符串,然后使用模型进行预测:

import os

import wave #Audio
import pyaudio #Audio

import time
import matplotlib.pyplot as plt
from math import ceil
import tensorflow as tf
import numpy as np

tf.compat.v1.enable_eager_execution() #We call this to establish a tf session

# Load Frozen Model
path = '/home/pi/Desktop/tflite-speech-recognition-master/saved_model_stop'
#print(path)
model = tf.keras.models.load_model(path)
#print(model)
model.summary()


# Pi Hat Config 
RESPEAKER_RATE = 16000 #Hz
RESPEAKER_CHANNELS = 2 # Originally 2 channel audio, slimmed to 1 channel for a 1D array of audio 
RESPEAKER_WIDTH = 2
RESPEAKER_INDEX = 2  # refer to input device id
CHUNK = 1024
RECORD_SECONDS = 1   # Change according to how many seconds to record for
WAVE_OUTPUT_FILENAME = "output.wav" #Temporary file name
WAVFILE = WAVE_OUTPUT_FILENAME #Clean up name

# Pyaudio
p = pyaudio.PyAudio() #To use pyaudio

#words = ["no","off","on","stop","_silence_","_unknown_","yes"] #Words in our model 
word = ["stop","not stop"]

def WWpredict(input_file):
    decoded_audio = decode_audio(input_file)
    #tf.print(decoded_audio,summarize =-1) #print full array
    print(decoded_audio)
    print(decoded_audio.shape)
    prediction = model.predict(decoded_audio,steps =None)
    guess = words[np.argmax(prediction)]
    print(guess)

def decode_audio(input_file):
    if input_file in os.listdir():
        print("Audio file found:", input_file)
        
    input_data = tf.io.read_file(input_file)
    print(input_data)
    audio,_d = tf.audio.decode_wav(input_data,RESPEAKER_CHANNELS)
    print(audio)
    print(_d)
    return audio

def record(): #This function will record 1 second of your voice every 1 second and will output a wav file that it will overwrite every second
    
    stream = p.open(
            rate=RESPEAKER_RATE,
            format=p.get_format_from_width(RESPEAKER_WIDTH),
            channels=RESPEAKER_CHANNELS,
            input=True,
            input_device_index=RESPEAKER_INDEX,)
 
    print("* recording")
 
    frames = []
 
    for i in range(0, ceil(RESPEAKER_RATE / CHUNK * RECORD_SECONDS)):
        data = stream.read(CHUNK)
        frames.append(data)
 
    print("* done recording")
    
    #print(len(frames), "bit audio:")
    #print(frames)
    #print(int.from_bytes(frames[-1],byteorder="big",signed = True)) #Integer for the last frame
    
    stream.stop_stream()
    stream.close()
    
    wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
    wf.setnchannels(RESPEAKER_CHANNELS)
    wf.setsampwidth(p.get_sample_size(p.get_format_from_width(RESPEAKER_WIDTH)))
    wf.setframerate(RESPEAKER_RATE)
    wf.writeframes(b''.join(frames))
    wf.close()
    
while(True):
    record()
    WWpredict(WAVFILE)
    time.sleep(1)

现在,当我们实际运行时,我最初得到以下输出:

tf.Tensor(
[[ 0.0000000e+00  0.0000000e+00]
 [ 0.0000000e+00  0.0000000e+00]
 [-3.0517578e-05 -3.0517578e-05]
 ...
 [ 2.2949219e-02  3.6926270e-03]
 [ 2.3315430e-02  3.3874512e-03]
 [ 2.2125244e-02  4.1198730e-03]], shape=(16384, 2), dtype=float32)
(16384, 2)

然而,这是意料之中的,我的预测将不适用于它,因为它需要它的维度为(无,16,16,1)。我没有任何线索,如何得到(16384,2)的二维数组,并使之成为(16,16)*,然后只需在后面添加None和1。 如果有人知道怎么做,请告诉我。16384可以被16整除,因为它是16位音频。 多谢各位

ValueError: Error when checking input: expected conv2d_input to have 4 dimensions, but got array with shape (16384, 2)

Tags: toimportnoneinputdataframesmodeltf