Schubert, Matthias (2004): Advanced Data Mining Techniques for Compound Objects. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics 

PDF
Schubert_Matthias.pdf 4MB 
Abstract
Knowledge Discovery in Databases (KDD) is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in large data collections. The most important step within the process of KDD is data mining which is concerned with the extraction of the valid patterns. KDD is necessary to analyze the steady growing amount of data caused by the enhanced performance of modern computer systems. However, with the growing amount of data the complexity of data objects increases as well. Modern methods of KDD should therefore examine more complex objects than simple feature vectors to solve realworld KDD applications adequately. Multiinstance and multirepresented objects are two important types of object representations for complex objects. Multiinstance objects consist of a set of object representations that all belong to the same feature space. Multirepresented objects are constructed as a tuple of feature representations where each feature representation belongs to a different feature space. The contribution of this thesis is the development of new KDD methods for the classification and clustering of complex objects. Therefore, the thesis introduces solutions for realworld applications that are based on multiinstance and multirepresented object representations. On the basis of these solutions, it is shown that a more general object representation often provides better results for many relevant KDD applications. The first part of the thesis is concerned with two KDD problems for which employing multiinstance objects provides efficient and effective solutions. The first is the data mining in CAD parts, e.g. the use of hierarchic clustering for the automatic construction of product hierarchies. The introduced solution decomposes a single part into a set of feature vectors and compares them by using a metric on multiinstance objects. Furthermore, multistep query processing using a novel filter step is employed, enabling the user to efficiently process similarity queries. On the basis of this similarity search system, it is possible to perform several distance based data mining algorithms like the hierarchical clustering algorithm OPTICS to derive product hierarchies. The second important application is the classification and search for complete websites in the world wide web (WWW). A website is a set of HTMLdocuments that is published by the same person, group or organization and usually serves a common purpose. To perform data mining for websites, the thesis presents several methods to classify websites. After introducing naive methods modelling websites as webpages, two more sophisticated approaches to website classification are introduced. The first approach uses a preprocessing that maps single HTMLdocuments within each website to socalled page classes. The second approach directly compares websites as sets of word vectors and uses nearest neighbor classification. To search the WWW for new, relevant websites, a focused crawler is introduced that efficiently retrieves relevant websites. This crawler minimizes the number of HTMLdocuments and increases the accuracy of website retrieval. The second part of the thesis is concerned with the data mining in multirepresented objects. An important example application for this kind of complex objects are proteins that can be represented as a tuple of a protein sequence and a text annotation. To analyze multirepresented objects, a clustering method for multirepresented objects is introduced that is based on the density based clustering algorithm DBSCAN. This method uses all representations that are provided to find a global clustering of the given data objects. However, in many applications there already exists a sophisticated class ontology for the given data objects, e.g. proteins. To map new objects into an ontology a new method for the hierarchical classification of multirepresented objects is described. The system employs the hierarchical structure of the ontology to efficiently classify new proteins, using support vector machines.
Item Type:  Thesis (Dissertation, LMU Munich) 

Keywords:  Data Mining, Multirepresented Objects, MultiInstace Objekts, Website Mining, Website Crawler, Knowledge Discovery, Protein Classification, Clustering of CADParts 
Subjects:  600 Natural sciences and mathematics 600 Natural sciences and mathematics > 510 Mathematics 
Faculties:  Faculty of Mathematics, Computer Science and Statistics 
Language:  English 
Date Accepted:  9. November 2004 
1. Referee:  Kriegel, HansPeter 
Persistent Identifier (URN):  urn:nbn:de:bvb:1927981 
MD5 Checksum of the PDFfile:  d436170782aa66033e499ece9ea01282 
Signature of the printed copy:  0001/UMC 14125 
ID Code:  2798 
Deposited On:  23. Nov 2004 
Last Modified:  16. Oct 2012 07:44 