Bottleneck and Embedding Representation of Speech for DNN-Based Language and Speaker Recognition

Alicia Lozano Díez

Ayuda

Bottleneck and Embedding Representation of Speech for DNN-Based Language and Speaker Recognition

Autores: Alicia Lozano Díez
Directores de la Tesis: Joaquín González Rodríguez (dir. tes.), Javier González Domínguez (dir. tes.)
Lectura: En la Universidad Autónoma de Madrid ( España ) en 2018
Idioma: inglés
Número de páginas: 138
Tribunal Calificador de la Tesis: Luis Alfonso Hernández Gómez (presid.), Daniel Ramos Castro (secret.), Ascensión Gallardo Antolín (voc.), Lukas Burget (voc.), Mitchell Mclaren (voc.)
Programa de doctorado: Programa de Doctorado en Ingeniería Informática y de Telecomunicación por la Universidad Autónoma de Madrid
Materias:
- Matemáticas
  - Ciencia de los ordenadores
    - Inteligencia artificial
- Ciencias tecnológicas
  - Tecnología de las telecomunicaciones
Enlaces
- Tesis en acceso abierto en: Biblos-e Archivo
Resumen
- Automatic speech recognition has experienced a breathtaking progress in the last few years, partially thanks to the introduction of deep neural networks into their approaches. This evolution in speech recognition systems has spread across related areas such as language and speaker recognition, where deep neural networks have noticeably improved their performance.
  
  In this PhD thesis, we have explored different approaches to the tasks of speaker and language recognition, focusing on systems where deep neural networks become part of traditional pipelines, replacing some stages or the whole system itself.
  
  Specifically, in the first experimental block, end-to-end language recognition systems based on deep neural networks are analyzed, where the neural network is used as classifier directly, without the use of any other backend but performing the language recognition task from the scores (posterior probabilities) provided by the network. Besides, these research works are focused on two architectures, convolutional neural networks and long short-term memory (LSTM) recurrent neural networks, which are less demanding in terms of computational resources due to the reduced amount of free parameters in comparison with other deep neural networks. Thus, these systems constitute an alternative to classical i-vectors, and achieve comparable results to them, especially when dealing with short utterances. In particular, we conducted experiments comparing a system based on convolutional neural networks with classical Factor Analysis GMM and i-vector reference systems, and evaluate them on two different tasks from the National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) 2009: one focused on language-pairs and the other, on multi-class language identification. Results shown comparable performance of the convolutional neural network based approaches and some improvements are achieved when fusing both classical and neural network approaches. We also present the experiments performed with LSTM recurrent neural networks, which have proven their ability to model time depending sequences. We evaluate our LSTM-based language recognition systems on different subsets of the NIST LRE 2009 and 2015, where LSTM systems are able to outperform the reference i-vector system, providing a model with less parameters, although more prone to overfitting and not able to generalize as well as i-vector in mismatched datasets.
  
  In the second experimental block of this Dissertation, we explore one of the most prominent applications of deep neural networks in speech processing, which is their use as feature extractors. In this kind of systems, deep neural networks are used to obtain a frame-by-frame representation of the speech signal, the so-called bottleneck feature vector, which is learned directly by the network and is then used instead of traditional acoustic features as input in language and speaker recognition systems based on i-vectors. This approach revolutionized these two fields, since they highly outperformed classical systems which had been state-of-the-art for many year (i-vector based on acoustic features). Our analysis focuses on how different configurations of the neural network used as bottleneck feature extractor, and which is trained for automatic speech recognition, influences performance of resulting features for language and speaker recognition. For the case of language recognition, we compare bottleneck features from networks that vary their depth in terms of number of hidden layers, the position of the bottleneck layer where it comprises the information and the number of units (size) of this layer, which would influence the representation obtained by the network. With the set of experiments performed on bottleneck features for speaker recognition, we analyzed the influence of the type of features used to feed the network, their pre-processing and, in general, the optimization of the network for the task of feature extraction for speaker recognition, which might not mean the optimal configuration for ASR.
  
  Finally, the third experimental block of this Thesis proposes a novel approach for language recognition, in which the neural network is used to extract a fixed-length utterance-level representation of speech segments known as embedding, able to replace the classical i-vector, and overcoming the variable length sequence of feature provided by the bottleneck features. This embedding based approach has recently shown promising results for speaker verification tasks, and our proposed system was able to outperform a strong state-of-the-art reference i-vector system on the last challenging language recognition evaluations organized by NIST in 2015 and 2017. Thus, we analyze language recognition systems based on embeddings, and explore different deep neural network architectures and data augmentation techniques to improve results of our system. In general, these embeddings are a fair competitor to the well-established i-vector pipeline which allows replacing the whole i-vector model by a deep neural network. Furthermore, the network is able to extract complementary information to the one contained in the i-vectors, even from the same input features. All this makes us consider that this contribution is an interesting research line to explore in other fields.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: