Advanced Probabilistic Models for Clustering and Projection

www.lmu.de | UB | Blättern | FAQ

Zur erweiterten Suche

English

Zur erweiterten Suche

Probabilistic modeling for data mining and machine learning problems is a fundamental research area. The general approach is to assume a generative model underlying the observed data, and estimate model parameters via likelihood maximization. It has the deep probability theory as the mathematical background, and enjoys a large amount of methods from statistical learning, sampling theory and Bayesian statistics. In this thesis we study several advanced probabilistic models for data clustering and feature projection, which are the two important unsupervised learning problems. The goal of clustering is to group similar data points together to uncover the data clusters. While numerous methods exist for various clustering tasks, one important question still remains, i.e., how to automatically determine the number of clusters. The first part of the thesis answers this question from a mixture modeling perspective. A finite mixture model is first introduced for clustering, in which each mixture component is assumed to be an exponential family distribution for generality. The model is then extended to an infinite mixture model, and its strong connection to Dirichlet process (DP) is uncovered which is a non-parametric Bayesian framework. A variational Bayesian algorithm called VBDMA is derived from this new insight to learn the number of clusters automatically, and empirical studies on some 2D data sets and an image data set verify the effectiveness of this algorithm. In feature projection, we are interested in dimensionality reduction and aim to find a low-dimensional feature representation for the data. We first review the well-known principal component analysis (PCA) and its probabilistic interpretation (PPCA), and then generalize PPCA to a novel probabilistic model which is able to handle non-linear projection known as kernel PCA. An expectation-maximization (EM) algorithm is derived for kernel PCA such that it is fast and applicable to large data sets. Then we propose a novel supervised projection method called MORP, which can take the output information into account in a supervised learning context. Empirical studies on various data sets show much better results compared to unsupervised projection and other supervised projection methods. At the end we generalize MORP probabilistically to propose SPPCA for supervised projection, and we can also naturally extend the model to S2PPCA which is a semi-supervised projection method. This allows us to incorporate both the label information and the unlabeled data into the projection process. In the third part of the thesis, we introduce a unified probabilistic model which can handle data clustering and feature projection jointly. The model can be viewed as a clustering model with projected features, and a projection model with structured documents. A variational Bayesian learning algorithm can be derived, and it turns out to iterate the clustering operations and projection operations until convergence. Superior performance can be obtained for both clustering and projection.

Probabilistic modeling, Clustering, Projection, Mixture modeling, Dirichlet process, Dimensionality reduction, Feature transformation

Yu, Shipeng

29. Sep. 2006

2006

Englisch

Universitätsbibliothek der Ludwig-Maximilians-Universität München

https://nbn-resolving.org/urn:nbn:de:bvb:19-58840

Yu, Shipeng (2006): Advanced Probabilistic Models for Clustering and Projection. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik

Vorschau

PDF
Yu_Shipeng.pdf
10MB

DOI: 10.5282/edoc.5884

URN: urn:nbn:de:bvb:19-58840

Abstract

Dokumententyp:	Dissertationen (Dissertation, LMU München)
Keywords:	Probabilistic modeling, Clustering, Projection, Mixture modeling, Dirichlet process, Dimensionality reduction, Feature transformation
Themengebiete:	500 Naturwissenschaften und Mathematik 500 Naturwissenschaften und Mathematik > 510 Mathematik
Fakultäten:	Fakultät für Mathematik, Informatik und Statistik
Sprache der Hochschulschrift:	Englisch
Datum der mündlichen Prüfung:	29. September 2006
1. Berichterstatter:in:	Kriegel, Hans-Peter
MD5 Prüfsumme der PDF-Datei:	85e5058373a8b2b128c5230ce64042cb
Signatur der gedruckten Ausgabe:	0001/UMC 15691
ID Code:	5884
Eingestellt am:	12. Oct. 2006
Letzte Änderungen:	24. Oct. 2020 09:12

Nur für Administratoren und Editoren: Dokument bearbeiten