Xu, Zhao (2007): Statistical relational learning with nonparametric Bayesian models. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik 

PDF
xu_zhao.pdf 1MB 
Abstract
Statistical relational learning analyzes the probabilistic constraints between the entities, their attributes and relationships. It represents an area of growing interest in modern data mining. Many leading researches are proposed with promising results. However, there is no easily applicable recipe of how to turn a relational domain (e.g. a database) into a probabilistic model. There are mainly two reasons. First, structural learning in relational models is even more complex than structural learning in (nonrelational) Bayesian networks due to the exponentially many attributes an attribute might depend on. Second, it might be difficult and expensive to obtain reliable prior knowledge for the domains of interest. To remove these constraints, this thesis applies nonparametric Bayesian analysis to relational learning and proposes two compelling models: Dirichlet enhanced relational learning and infinite hidden relational learning. Dirichlet enhanced relational learning (DERL) extends nonparametric hierarchical Bayesian modeling to relational data. In existing relational models, the model parameters are global, which means the conditional probability distributions are the same for each entity and the relationships are independent of each other. To solve the limitations, we introduce hierarchical Bayesian (HB) framework to relational learning, such that model parameters can be personalized, i.e. owned by entities or relationships, and are coupled via common prior distributions. Additional flexibility is introduced in a nonparametric HB modeling, such that the learned knowledge can be truthfully represented. For inference, we develop an efficient variational method, which is motivated by the Polya urn representation of DP. DERL is demonstrated in a medical domain where we form a nonparametric HB model for entities involving hospitals, patients, procedures and diagnoses. The experiments show that the additional flexibility introduced by the nonparametric HB modeling results in a more accurate model to represent the dependencies between different types of relationships and gives significantly improved prediction performance about unknown relationships. In infinite hidden relational model (IHRM), we apply nonparametric mixture modeling to relational data, which extends the expressiveness of a relational model by introducing for each entity an infinitedimensional hidden variable as part of a Dirichlet process (DP) mixture model. There are mainly three advantages. First, this reduces the extensive structural learning, which is particularly difficult in relational models due to the huge number of potential probabilistic parents. Second, the information can globally propagate in the ground network defined by the relational structure. Third, the number of mixture components for each entity class can be optimized by the model itself based on the data. IHRM can be applied for entity clustering and relationship/attribute prediction, which are two important tasks in relational data mining. For inference of IHRM, we develop four algorithms: collapsed Gibbs sampling with the Chinese restaurant process, blocked Gibbs sampling with the truncated stick breaking construction (SBC), and meanfield inference with truncated SBC, as well as an empirical approximation. IHRM is evaluated in three different domains: a recommendation system based on the MovieLens data set, prediction of the functions of yeast genes/proteins on the data set of KDD Cup 2001, and the medical data analysis. The experimental results show that IHRM gives significantly improved estimates of attributes/relationships and highly interpretable entity clusters in complex relational data.
Dokumententyp:  Dissertation (Dissertation, LMU München) 

Keywords:  Statistical relational learning, relationship uncertainty, link prediction, entity clustering, nonparametric Bayesian analysis, hierarchical Bayesian models, mixture models, Dirichlet process, variational inference, MCMC sampling 
Themengebiete:  500 Naturwissenschaften und Mathematik > 510 Mathematik
500 Naturwissenschaften und Mathematik 
Fakultäten:  Fakultät für Mathematik, Informatik und Statistik 
Sprache der Hochschulschrift:  Englisch 
Datum der mündlichen Prüfung:  25. Juli 2007 
1. Berichterstatter/in:  Kriegel, HansPeter 
MD5 Prüfsumme der PDFDatei:  13dff99ddd9b77079cf4124a3216fa60 
Signatur der gedruckten Ausgabe:  0001/UMC 16607 
ID Code:  7619 
Eingestellt am:  05. Nov. 2007 
Letzte Änderungen:  19. Jul. 2016 16:23 