Statistical distances and probability metrics for multivariate data, ensembles and probability distributions

Gabriel Martos Venturini

Ayuda

Statistical distances and probability metrics for multivariate data, ensembles and probability distributions

Autores: Gabriel Martos Venturini
Directores de la Tesis: Alberto Muñoz García (dir. tes.)
Lectura: En la Universidad Carlos III de Madrid ( España ) en 2015
Idioma: inglés
Tribunal Calificador de la Tesis: Santiago Velilla Cerdán (presid.), Veronica Vinciotti Vinciotti (secret.), Emilio Carrizosa Priego (voc.)
Materias:
- Matemáticas
Enlaces
- Tesis en acceso abierto en: e-Archivo
Resumen
- The use of distance measures in Statistics is of fundamental importance in solving practical problems, such us hypothesis testing, independence contrast, goodness of fit tests, classification tasks, outlier detection and density estimation methods, to name just a few. The Mahalanobis distance was originally developed to compute the distance from a point to the center of a distribution taking into account the distribution of the data, in this case the normal distribution. This is the only distance measure in the statistical literature that takes into account the probabilistic information of the data. In this thesis we address the study of different distance measures that share a fundamental characteristic: all the proposed distances incorporate probabilistic information. The thesis is organized as follows: In Chapter 1 we motivate the problems addressed in this thesis. In Chapter 2 we present the usual definitions and properties of the different distance measures for multivariate data and for probability distributions treated in the statistical literature. In Chapter 3 we propose a distance that generalizes the Mahalanobis distance to the case where the distribution of the data is not Gaussian. To this aim, we introduce a Mercer Kernel based on the distribution of the data at hand. The Mercer Kernel induces distances from a point to the center of a distribution. In this chapter we also present a plug-in estimator of the distance that allows us to solve classification and outlier detection problems in an efficient way. In Chapter 4 of this thesis, we present two new distance measures for multivariate data that incorporate the probabilistic information contained in the sample. In this chapter we also introduce two estimation methods for the proposed distances and we study empirically their convergence. In the experimental section of Chapter 4 we solve classification problems and obtain better results than several standard classification methods in the literature of discriminant analysis. In Chapter 5 we propose a new family of probability metrics and we study its theoretical properties. We introduce an estimation method to compute the proposed distances that is based on the estimation of the level sets, avoiding in this way the difficult task of density estimation. In this chapter we show that the proposed distance is able to solve hypothesis tests and classification problems in general contexts, obtaining better results than other standard methods in statistics. In Chapter 6 we introduce a new distance for sets of points. To this end, we define a dissimilarity measure for points by using a Mercer Kernel that is extended later to a Mercer Kernel for sets of points. In this way, we are able to induce a dissimilarity index for sets of points that it is used as an input for an adaptive k-mean clustering algorithm. The proposed clustering algorithm considers an alignment of the sets of points by taking into account a wide range of possible wrapping functions. This chapter presents an application to clustering neuronal spike trains, a relevant problem in neural coding. Finally, in Chapter 7, we present the general conclusions of this thesis and the future research lines.

Acceso de usuarios registrados

¿Olvidó su contraseña?

¿Es nuevo? Regístrese

Ventajas de registrarse

Dialnet Plus

Opciones de compartir

Opciones de entorno

Sugerencia / Errata

Coordinado por: