Yu, Kai (2004): Statistical Learning Approaches to Information Filtering. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics 

PDF
yu_kai.pdf 3MB 
Abstract
Enabling computer systems to understand human thinking or behaviors has ever been an exciting challenge to computer scientists. In recent years one such a topic, information filtering, emerges to help users find desired information items (e.g.~movies, books, news) from large amount of available data, and has become crucial in many applications, like product recommendation, image retrieval, spam email filtering, news filtering, and web navigation etc.. An information filtering system must be able to understand users' information needs. Existing approaches either infer a user's profile by exploring his/her connections to other users, i.e.~collaborative filtering (CF), or analyzing the content descriptions of liked or disliked examples annotated by the user, ~i.e.~contentbased filtering (CBF). Those methods work well to some extent, but are facing difficulties due to lack of insights into the problem. This thesis intensively studies a wide scope of information filtering technologies. Novel and principled machine learning methods are proposed to model users' information needs. The work demonstrates that the uncertainty of user profiles and the connections between them can be effectively modelled by using probability theory and Bayes rule. As one major contribution of this thesis, the work clarifies the ``structure'' of information filtering and gives rise to principled solutions. In summary, the work of this thesis mainly covers the following three aspects: Collaborative filtering: We develop a probabilistic model for memorybased collaborative filtering (PMCF), which has clear links with classical memorybased CF. Various heuristics to improve memorybased CF have been proposed in the literature. In contrast, extensions based on PMCF can be made in a principled probabilistic way. With PMCF, we describe a CF paradigm that involves interactions with users, instead of passively receiving data from users in conventional CF, and actively chooses the most informative patterns to learn, thereby greatly reduce user efforts and computational costs. Contentbased filtering: One major problem for CBF is the deficiency and high dimensionality of contentdescriptive features. Information items (e.g.~images or articles) are typically described by highdimensional features with mixed types of attributes, that seem to be developed independently but intrinsically related. We derive a generalized principle component analysis to merge highdimensional and heterogenous content features into a lowdimensional continuous latent space. The derived features brings great conveniences to CBF, because most existing algorithms easily cope with lowdimensional and continuous data, and more importantly, the extracted data highlight the intrinsic semantics of original content features. Hybrid filtering: How to combine CF and CBF in an ``smart'' way remains one of the most challenging problems in information filtering. Little principled work exists so far. This thesis reveals that people's information needs can be naturally modelled with a hierarchical Bayesian thinking, where each individual's data are generated based on his/her own profile model, which itself is a sample from a common distribution of the population of user profiles. Users are thus connected to each other via this common distribution. Due to the complexity of such a distribution in realworld applications, usually applied parametric models are too restrictive, and we thus introduce a nonparametric hierarchical Bayesian model using Dirichlet process. We derive effective and efficient algorithms to learn the described model. In particular, the finally achieved hybrid filtering methods are surprisingly simple and intuitively understandable, offering clear insights to previous work on pure CF, pure CBF, and hybrid filtering.
Item Type:  Thesis (Dissertation, LMU Munich) 

Keywords:  information filtering, information retrieval, machine learning, Bayesian modelling 
Subjects:  500 Natural sciences and mathematics 500 Natural sciences and mathematics > 510 Mathematics 
Faculties:  Faculty of Mathematics, Computer Science and Statistics 
Language:  English 
Date of oral examination:  20. July 2004 
1. Referee:  Kriegel, HansPeter 
MD5 Checksum of the PDFfile:  cb0bbd96a807203af1b7c4436265e3f0 
Signature of the printed copy:  0001/UMC 13963 
ID Code:  2512 
Deposited On:  20. Sep 2004 
Last Modified:  19. Jul 2016 16:16 