This project investigates an approach in speaker verification using a multilayer perceptron's neural network trained with Mel Frequency Cepstral Coefficients (MFCC) parametrization from two classes: one of client speaker voice and the other from of other speakers . The role of the trained MLP is then to discriminate between these two classes by returning the percentage error of a given input versus each class. Many experiments were performed in order to find out the best neural network configuration. speAker VeRIfication Laboratory.
Introduction
A.VRI.L is a project aimed to create a speaker verification laboratory based on a Multi Layer Perceptron's (MLP) neural network. Traditionally this task is performed using Hidden Markov Models, (HMM) or with the HMM-MLP combination. Speaker verification will then be divised in three parts: first a model of the speaker (MFCC) is created using the HTK [ref-HTK] tool. The MLP network is trained with this file and a combination of several “non-client” speaker models (also called “world”). One of the difficulties of this approach is the modest amount of client data available, compared to world data, which can be bypassed by supplying several times the same client model. Once the MLP is trained for a given speaker, its internal node configuration is saved to a file, which can be later used to perform speaker verification. It's important to clear distinguish speaker verification from speaker identification: this project is focused on the first task, that is, distinguish between a client speaker and impostor not to resolve speaker identity. The development of this type of speaker verification techniques could follow two ways; the first way, on which this project focuses, analyzes a global modeling and training approach: coding of an MLP (using the Torch 3 [ref-Torch] library) and training using global MFCC. A global MFCC is intended as a global modelization of speaker voice. The other way would have provided a more complicated approach using multiple MLP for each phonetic class: voice modeling would have to be classified accordingly to phonetic classes and then, data for each classes would have been used to train an MLP. Speaker verification would have then been the result of multiple verifications conducted on different phonetic classes followed by a “score” recombination. As said before, this project only focuses on the first way, this means that only the global parametrization and discrimination approach has been subject of study and has been developed; a segmentational approach, starting from what has been done should not be very hard, considering that the most critical part (the development of the neural network software) is working and performing very good.