Ayuda
Ir al contenido

Dialnet


Speaker localization and orientation in multimodal smart environments

  • Autores: Carlos Segura Perales
  • Directores de la Tesis: Francisco Javier Hernando Pericás (dir. tes.)
  • Lectura: En la Universitat Politècnica de Catalunya (UPC) ( España ) en 2011
  • Idioma: inglés
  • Tribunal Calificador de la Tesis: Climent Nadeu Camprubí (presid.), Antonio Bonafonte Cávez (secret.), Dusan Macho (voc.), Alberto Abad Gareta (voc.), Cristian Canton Ferrer (voc.)
  • Materias:
  • Texto completo no disponible (Saber más ...)
  • Resumen
    • Significant research efforts have been focused on developing human computer interfaces in intelligent environments that aim to support human tasks and activities. The knowledge of the position and the orientation of the speakers present in a room constitutes a valuable information allowing a better understanding of user activities and human interactions in those environments, such as the analysis of group dynamics or behaviors, deciding which is the active speaker among all the presents or determining who is talking to whom. Within this context, this thesis addresses the problem of speaker tracking and speaker orientation estimation in real scenarios, with focus on seminars and meetings that take place in a smart-room.

      This thesis describes the development of a robust speaker tracking system based on the audio signals captured by a set of distributed microphones. The combination of the generalized cross-correlation estimations between pairs of microphones based on one of the most successful state of the art algorithms generates a spatial likelihood function, whose maximum is the localized position of the acoustic source. However, the location estimates gathered by the acoustic localization algorithms are usually contaminated by spurious measurements due to noises or reflections of the voice with adjacents objects. Two approaches to filter the noise-corrupted location estimates according to a motion model to obtain a reliable smooth track of the acoustic sources are proposed based on the Kalman filter and sequential Monte Carlo methods. These two tracking systems are extended to track speakers employing on audiovisual information. The acoustic localization and tracking algorithms described in this thesis have been adapted to other speech-related fields of research, resulting in contributions to the speaker identification, speaker diarization and acoustic event detection tasks.

      Regarding speaker orientation estimation, the interest in this problem based on multiple microphone recordings is so recent that very few works can be found in the speech related literature and has been mostly tackled based on visual cues only. Moreover, the lack of public acoustic databases with accurate head orientation annotations required that published works on speaker orientation estimation made use of their own databases, making a fair comparison between methods difficult.

      A standalone energy-based head orientation estimation method is proposed which makes use of the frequency dependence of the head radiation pattern.

      The proposed method is combined with a video head pose estimation approach in several multimodal fusion schemes at data and decision level, achieving an important reduction of the orientation estimation error in relation to both monomodal audio and video systems.

      Additionally, the cross-correlation function between pairs of microphones is studied as a basis for speaker orientation estimation, which is stated as having a strong dependence with the speaker orientation and frequency. A two step algorithm for first estimating the position and then the speaker orientation is proposed based on cross-correlation orientational cues. The experimental results show an impressive performance in regard with other audio based state of the art approaches, achieving a similar results to video algorithms.

      Finally, a particle filter approach for joint head position and 3D orientation estimation is developed. The proposed system take advantage of the fact that speaker orientation information could be used to improve the speaker localization precision and vice versa, by defining a joint state vector and joint likelihood functions in the filter dynamical model. Experiments conducted over a purposely recorded database with annotated head center position and 3D head rotation showed an increased performance for the joint approach in relation with the 2-step algorithm in terms of localization and orientation angle precision.


Fundación Dialnet

Dialnet Plus

  • Más información sobre Dialnet Plus

Opciones de compartir

Opciones de entorno