Logo Logo
Switch language to English
Yu, Kai (2004): Statistical Learning Approaches to Information Filtering. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik



Enabling computer systems to understand human thinking or behaviors has ever been an exciting challenge to computer scientists. In recent years one such a topic, information filtering, emerges to help users find desired information items (e.g.~movies, books, news) from large amount of available data, and has become crucial in many applications, like product recommendation, image retrieval, spam email filtering, news filtering, and web navigation etc.. An information filtering system must be able to understand users' information needs. Existing approaches either infer a user's profile by exploring his/her connections to other users, i.e.~collaborative filtering (CF), or analyzing the content descriptions of liked or disliked examples annotated by the user, ~i.e.~content-based filtering (CBF). Those methods work well to some extent, but are facing difficulties due to lack of insights into the problem. This thesis intensively studies a wide scope of information filtering technologies. Novel and principled machine learning methods are proposed to model users' information needs. The work demonstrates that the uncertainty of user profiles and the connections between them can be effectively modelled by using probability theory and Bayes rule. As one major contribution of this thesis, the work clarifies the ``structure'' of information filtering and gives rise to principled solutions. In summary, the work of this thesis mainly covers the following three aspects: Collaborative filtering: We develop a probabilistic model for memory-based collaborative filtering (PMCF), which has clear links with classical memory-based CF. Various heuristics to improve memory-based CF have been proposed in the literature. In contrast, extensions based on PMCF can be made in a principled probabilistic way. With PMCF, we describe a CF paradigm that involves interactions with users, instead of passively receiving data from users in conventional CF, and actively chooses the most informative patterns to learn, thereby greatly reduce user efforts and computational costs. Content-based filtering: One major problem for CBF is the deficiency and high dimensionality of content-descriptive features. Information items (e.g.~images or articles) are typically described by high-dimensional features with mixed types of attributes, that seem to be developed independently but intrinsically related. We derive a generalized principle component analysis to merge high-dimensional and heterogenous content features into a low-dimensional continuous latent space. The derived features brings great conveniences to CBF, because most existing algorithms easily cope with low-dimensional and continuous data, and more importantly, the extracted data highlight the intrinsic semantics of original content features. Hybrid filtering: How to combine CF and CBF in an ``smart'' way remains one of the most challenging problems in information filtering. Little principled work exists so far. This thesis reveals that people's information needs can be naturally modelled with a hierarchical Bayesian thinking, where each individual's data are generated based on his/her own profile model, which itself is a sample from a common distribution of the population of user profiles. Users are thus connected to each other via this common distribution. Due to the complexity of such a distribution in real-world applications, usually applied parametric models are too restrictive, and we thus introduce a nonparametric hierarchical Bayesian model using Dirichlet process. We derive effective and efficient algorithms to learn the described model. In particular, the finally achieved hybrid filtering methods are surprisingly simple and intuitively understandable, offering clear insights to previous work on pure CF, pure CBF, and hybrid filtering.