Kailing, Karin (2004): New Techniques for Clustering Complex Objects. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik 

PDF
kailing_karin.pdf 3MB 
Abstract
The tremendous amount of data produced nowadays in various application domains such as molecular biology or geography can only be fully exploited by efficient and effective data mining tools. One of the primary data mining tasks is clustering, which is the task of partitioning points of a data set into distinct groups (clusters) such that two points from one cluster are similar to each other whereas two points from distinct clusters are not. Due to modern database technology, e.g.object relational databases, a huge amount of complex objects from scientific, engineering or multimedia applications is stored in database systems. Modelling such complex data often results in very highdimensional vector data ("feature vectors"). In the context of clustering, this causes a lot of fundamental problems, commonly subsumed under the term "Curse of Dimensionality". As a result, traditional clustering algorithms often fail to generate meaningful results, because in such highdimensional feature spaces data does not cluster anymore. But usually, there are clusters embedded in lower dimensional subspaces, i.e. meaningful clusters can be found if only a certain subset of features is regarded for clustering. The subset of features may even be different for varying clusters. In this thesis, we present original extensions and enhancements of the densitybased clustering notion to cope with highdimensional data. In particular, we propose an algorithm called SUBCLU (densityconnected Subspace Clustering) that extends DBSCAN (DensityBased Spatial Clustering of Applications with Noise) to the problem of subspace clustering. SUBCLU efficiently computes all clusters of arbitrary shape and size that would have been found if DBSCAN were applied to all possible subspaces of the feature space. Two subspace selection techniques called RIS (Ranking Interesting Subspaces) and SURFING (SUbspaces Relevant For clusterING) are proposed. They do not compute the subspace clusters directly, but generate a list of subspaces ranked by their clustering characteristics. A hierarchical clustering algorithm can be applied to these interesting subspaces in order to compute a hierarchical (subspace) clustering. In addition, we propose the algorithm 4C (Computing Correlation Connected Clusters) that extends the concepts of DBSCAN to compute densitybased correlation clusters. 4C searches for groups of objects which exhibit an arbitrary but uniform correlation. Often, the traditional approach of modelling data as highdimensional feature vectors is no longer able to capture the intuitive notion of similarity between complex objects. Thus, objects like chemical compounds, CAD drawings, XML data or color images are often modelled by using more complex representations like graphs or trees. If a metric distance function like the edit distance for graphs and trees is used as similarity measure, traditional clustering approaches like densitybased clustering are applicable to those data. However, we face the problem that a single distance calculation can be very expensive. As clustering performs a lot of distance calculations, approaches like filter and refinement and metric indices get important. The second part of this thesis deals with special approaches for clustering in application domains with complex similarity models. We show, how appropriate filters can be used to enhance the performance of query processing and, thus, clustering of hierarchical objects. Furthermore, we describe how the two paradigms of filtering and metric indexing can be combined. As complex objects can often be represented by using different similarity models, a new clustering approach is presented that is able to cluster objects that provide several different complex representations.
Dokumententyp:  Dissertation (Dissertation, LMU München) 

Keywords:  data mining, densitybased clustering, subspace clustering, correlation clustering, complex objects 
Themengebiete:  500 Naturwissenschaften und Mathematik
500 Naturwissenschaften und Mathematik > 510 Mathematik 
Fakultäten:  Fakultät für Mathematik, Informatik und Statistik 
Sprache der Hochschulschrift:  Englisch 
Datum der mündlichen Prüfung:  15. November 2004 
1. Berichterstatter/in:  Kriegel, HansPeter 
MD5 Prüfsumme der PDFDatei:  254efdf810403ab2dd8e916201ee7759 
Signatur der gedruckten Ausgabe:  0001/UMC 14149 
ID Code:  2840 
Eingestellt am:  24. Nov. 2004 
Letzte Änderungen:  19. Jul. 2016 16:17 