Wackersreuther, Bianca (2011): Efficient Knowledge Extraction from Structured Data. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics 

PDF
Wackersreuther_Bianca.pdf 4MB 
Abstract
Knowledge extraction from structured data aims for identifying valid, novel, potentially useful, and ultimately understandable patterns in the data. The core step of this process is the application of a data mining algorithm in order to produce an enumeration of particular patterns and relationships in large databases. Clustering is one of the major data mining tasks and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. In this thesis, we advance the stateoftheart data mining algorithms for analyzing structured data types. We describe the development of innovative solutions for hierarchical data mining. The EMbased hierarchical clustering method ITCH (InformationTheoretic Cluster Hierarchies) is designed to propose solid solutions for four different challenges. (1) to guide the hierarchical clustering algorithm to identify only meaningful and valid clusters. (2) to represent each cluster content in the hierarchy by an intuitive description with e.g. a probability density function. (3) to consistently handle outliers. (4) to avoid difficult parameter settings. ITCH is built on a hierarchical variant of the informationtheoretic principle of Minimum Description Length (MDL). Interpreting the hierarchical cluster structure as a statistical model of the dataset, it can be used for effective data compression by Huffman coding. Thus, the achievable compression rate induces a natural objective function for clustering, which automatically satisfies all four above mentioned goals. The geneticbased hierarchical clustering algorithm GACH (Genetic Algorithm for finding Cluster Hierarchies) overcomes the problem of getting stuck in a local optimum by a beneficial combination of genetic algorithms, information theory and modelbased clustering. Besides hierarchical data mining, we also made contributions to more complex data structures, namely objects that consist of mixed type attributes and skyline objects. The algorithm INTEGRATE performs integrative mining of heterogeneous data, which is one of the major challenges in the next decade, by a unified view on numerical and categorical information in clustering. Once more, supported by the MDL principle, INTEGRATE guarantees the usability on real world data. For skyline objects we developed SkyDist, a similarity measure for comparing different skyline objects, which is therefore a first step towards performing data mining on this kind of data structure. Applied in a recommender system, for example SkyDist can be used for pointing the user to alternative car types, exhibiting a similar price/mileage behavior like in his original query. For mining graphstructured data, we developed different approaches that have the ability to detect patterns in static as well as in dynamic networks. We confirmed the practical feasibility of our novel approaches on large realworld case studies ranging from medical brain data to biological yeast networks. In the second part of this thesis, we focused on boosting the knowledge extraction process. We achieved this objective by an intelligent adoption of Graphics Processing Units (GPUs). The GPUs have evolved from simple devices for the display signal preparation into powerful coprocessors that do not only support typical computer graphics tasks but can also be used for general numeric and symbolic computations. As major advantage, GPUs provide extreme parallelism combined with a high bandwidth in memory transfer at low cost. In this thesis, we propose algorithms for computationally expensive data mining tasks like similarity search and different clustering paradigms which are designed for the highly parallel environment of a GPU, called CUDADClust and CUDAkmeans. We define a multidimensional index structure which is particularly suited to support similarity queries under the restricted programming model of a GPU. We demonstrate the superiority of our algorithms running on GPU over their conventional counterparts on CPU in terms of efficiency.
Item Type:  Thesis (Dissertation, LMU Munich) 

Keywords:  Data Mining, KDD, GPU 
Subjects:  000 Computers, Information and General Reference > 004 Data processing computer science 000 Computers, Information and General Reference 
Faculties:  Faculty of Mathematics, Computer Science and Statistics 
Language:  English 
Date Accepted:  15. December 2011 
1. Referee:  Böhm, Christian 
Persistent Identifier (URN):  urn:nbn:de:bvb:19138079 
MD5 Checksum of the PDFfile:  e753b46c03569fcc79d9dad38a21b323 
Signature of the printed copy:  0001/UMC 19986 
ID Code:  13807 
Deposited On:  10. Jan 2012 10:21 
Last Modified:  16. Oct 2012 08:56 