keras：如何编写定制的损失函数来将帧级预测聚合到歌曲级预测

1条回答

网友

1楼 · 发布于 2024-04-25 07:51:47

体裁分类通常不需要定制的损失函数。用Multiple Instance Learning（MIL）可以将一首歌曲分成多个预测窗口。在

MIL是一种有监督的学习方法，其中标签不在每个独立样本（实例）上，而是在实例的“包”（无序集）上。在您的例子中，实例是每5秒窗口的MFCC功能，包就是整首歌。在

在Keras中，我们使用TimeDistributed层对所有窗口执行我们的模型。然后我们使用GlobalAveragePooling1D有效地结合了结果实施跨窗口平均投票。这比多数投票更容易区分。在

下面是一个可运行的示例：

import math

import keras
import librosa
import pandas
import numpy
import sklearn

def window_model(n_bands, n_frames, n_classes, hidden=32):
   from keras.layers import Input, Dense, Flatten, Conv2D, MaxPooling2D

   out_units = 1 if n_classes == 2 else n_classes
   out_activation = 'sigmoid' if n_classes == 2 else 'softmax'

   shape = (n_bands, n_frames, 1)

   # Basic CNN model
   # An MLP could also be used, but may need to reshape on input and output
   model = keras.Sequential([
       Conv2D(16, (3,3), input_shape=shape),
       MaxPooling2D((2,3)),
       Conv2D(16, (3,3)),
       MaxPooling2D((2,2)),
       Flatten(),
       Dense(hidden, activation='relu'),
       Dense(hidden, activation='relu'),
       Dense(out_units, activation=out_activation),
   ])
   return model

def song_model(n_bands, n_frames, n_windows, n_classes=3):
    from keras.layers import Input, TimeDistributed, GlobalAveragePooling1D

    # Create the frame-wise model, will be reused across all frames
    base = window_model(n_bands, n_frames, n_classes)
    # GlobalAveragePooling1D expects a 'channel' dimension at end
    shape = (n_windows, n_bands, n_frames, 1)

    print('Frame model')
    base.summary()

    model = keras.Sequential([
        TimeDistributed(base, input_shape=shape),
        GlobalAveragePooling1D(),
    ])

    print('Song model')
    model.summary()

    model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['acc'])
    return model


def extract_features(path, sample_rate, n_bands, hop_length, n_frames, window_length, song_length):
    # melspectrogram might perform better with CNNs
    from librosa.feature import mfcc

    # Load a fixed length section of sound
    # Might need to pad if some songs are too short
    y, sr = librosa.load(path, sr=sample_rate, offset=0, duration=song_length)
    assert sr == sample_rate, sr
    _song_length = len(y)/sample_rate

    assert _song_length == song_length, _song_length

    # Split into windows
    window_samples = int(sample_rate * window_length)
    window_hop = window_samples//2 # use 50% overlap
    windows = librosa.util.frame(y, frame_length=window_samples, hop_length=window_hop)

    # Calculate features for each window
    features = []
    for w in range(windows.shape[1]):
        win = windows[:, w]
        f = mfcc(y=win, sr=sample_rate, n_mfcc=n_bands,
                 hop_length=hop_length, n_fft=2*hop_length)
        f = numpy.expand_dims(f, -1) # add channels dimension 
        features.append(f)

    features = numpy.stack(features)
    return features

def main():

    # Settings for our model
    n_bands = 13 # MFCCs
    sample_rate = 22050
    hop_length = 512
    window_length = 5.0
    song_length_max = 1.0*60
    n_frames = math.ceil(window_length / (hop_length/sample_rate))
    n_windows = math.floor(song_length_max / (window_length/2))-1

    model = song_model(n_bands, n_frames, n_windows)

    # Generate some example data
    ex =  librosa.util.example_audio_file()
    examples = 8
    numpy.random.seed(2)
    songs = pandas.DataFrame({
        'path': [ex] * examples,
        'genre': numpy.random.choice([ 'rock', 'metal', 'blues' ], size=examples),
    })
    assert len(songs.genre.unique() == 3) 

    print('Song data')
    print(songs)

    def get_features(path):
        f = extract_features(path, sample_rate, n_bands,
                    hop_length, n_frames, window_length, song_length_max)
        return f

    from sklearn.preprocessing import LabelBinarizer

    binarizer = LabelBinarizer()
    y = binarizer.fit_transform(songs.genre.values)
    print('y', y.shape, y)

    features = numpy.stack([ get_features(p) for p in songs.path ])
    print('features', features.shape)

    model.fit(features, y) 


if __name__ == '__main__':
    main()

示例输出内部和组合模型摘要：

^{pr2}$ ^{3}$

以及输入模型的特征向量的形状：

features (8, 23, 13, 216, 1)

8首歌曲，每个窗口23个窗口，13个MFCC乐队，每个窗口216帧。第五维度尺寸为1让凯拉开心。。。在

相关问题更多 >

编程相关推荐

热门问题

热门文章

keras：如何编写定制的损失函数来将帧级预测聚合到歌曲级预测

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >