He, Xiao (2014): Multipurpose exploratory mining of complex data. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics 

PDF
He_Xiao.pdf 5MB 
Abstract
Due to the increasing power of data acquisition and data storage technologies, a large amount of data sets with complex structure are collected in the era of data explosion. Instead of simple representations by lowdimensional numerical features, such data sources range from highdimensional feature spaces to graph data describing relationships among objects. Many techniques exist in the literature for mining simple numerical data but only a few approaches touch the increasing challenge of mining complex data, such as highdimensional vectors of nonnumerical data type, time series data, graphs, and multiinstance data where each object is represented by a finite set of feature vectors. Besides, there are many important data mining tasks for highdimensional data, such as clustering, outlier detection, dimensionality reduction, similarity search, classification, prediction and result interpretation. Many algorithms have been proposed to solve these tasks separately, although in some cases they are closely related. Detecting and exploiting the relationships among them is another important challenge. This thesis aims to solve these challenges in order to gain new knowledge from complex highdimensional data. We propose several new algorithms combining different data mining tasks to acquire novel knowledge from complex highdimensional data: ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data) automatically detects the most relevant overlapping subspace clusters on categorical data. It integrates clustering, feature selection and pattern mining without any input parameters in an information theoretic way. The next algorithm MSS (Multiple Subspace Selection) finds multiple lowdimensional subspaces for moderately highdimensional data, each exhibiting an interesting cluster structure. For better interpretation of the results, MSS visualizes the clusters in multiple lowdimensional subspaces in a hierarchical way. SCMiner (SummarizationCompression Miner) focuses on bipartite graph data, which integrates coclustering, graph summarization, link prediction, and the discovery of the hidden structure of a bipartite graph data on the basis of data compression. Finally, we propose a novel similarity measure for multiinstance data. The Probabilistic Integral Metric (PIM) is based on a probabilistic generative model requiring few assumptions. Experiments demonstrate the effectiveness and efficiency of PIM for similarity search (multiinstance data indexing with Mtree), explorative data analysis and data mining (multiinstance classification). To sum up, we propose algorithms combining different data mining tasks for complex data with various data types and data structures to discover the novel knowledge hidden behind the complex data.
Item Type:  Thesis (Dissertation, LMU Munich) 

Keywords:  Exploratory Data Mining, Subspace Clustering, Minimum Description Length, Multiinstance Indexing 
Subjects:  000 Computers, Information and General Reference 000 Computers, Information and General Reference > 004 Data processing computer science 
Faculties:  Faculty of Mathematics, Computer Science and Statistics 
Language:  English 
Date of oral examination:  5. November 2014 
1. Referee:  Böhm, Christian 
MD5 Checksum of the PDFfile:  d3c168f29b690c57bf68f4c1c95ce6e8 
Signature of the printed copy:  0001/UMC 22484 
ID Code:  17598 
Deposited On:  10. Nov 2014 08:52 
Last Modified:  20. Jul 2016 10:37 