Logo Logo
Hilfe
Kontakt
Switch language to English
On the detection of latent structures in categorical data
On the detection of latent structures in categorical data
With the growing availability of huge amounts of data it is increasingly important to uncover the underlying data generating structures. The present work focusses on the detection of latent structures for categorical data, which have been treated less intensely in the literature. In regression models categorical variables are either the responses or part of the covariates. Alternative strategies have to be used to detect the underlying structures. The first part of this thesis is dedicated to regression models with an excessive number of parameters. More concrete, we consider models with various categorical covariates and a potentially large number of categories. In addition, it is investigated how fixed effects models can be used to model the heterogeneity in longitudinal and cross-sectional data. One interesting aspect is to identify the categories or units that have to be distinguished with respect to their effect on the response. The objective is to detect ``latent groups'' that share the same effects on the response variable. A novel approach to the clustering of categorical predictors or fixed effects is introduced, which is based on recursive partitioning techniques. In contrast to competing methods that use specific penalties the proposed algorithm also works in high-dimensional settings. The second part of this thesis deals with item response models, which can be considered as regression models that aim at measuring ``latent abilities'' of persons. In item response theory one uses indicators such as the answers of persons to a collection of items to infer on the underlying abilities. When developing psychometric tests one has to be aware of the phenomenon of Differential Item Functioning (DIF). An item response model is affected by DIF if the difficulty of an item among equally able persons depends on characteristics of the persons, such as the membership to a racial or ethnic subgroup. A general tree-based method is proposed that simultaneously detects the items and subgroups of persons that carry DIF including a set of variables on different scales. Compared to classical approaches a main advantage is that the proposed method automatically identifies regions of the covariate space that are responsible for DIF and do not have to be prespecified. In addition, extensions to the detection of non-uniform DIF are developed. The last part of the thesis addresses regression models for rating scale data that are frequently used in behavioural research. Heterogeneity among respondents caused by ``latent response styles'' can lead to biased estimates and can affect the conclusion drawn from the observed ratings. The focus is on symmetric response categories and a specific form of response style, namely the tendency to the middle or extreme categories. In ordinal regression models a stronger or weaker concentration in the middle can also be interpreted as varying dispersion. The strength of the proposed models is that they can be embedded into the framework of generalized linear models and therefore inference techniques and asymptotic results for this class of models are available. In addition, a visualization tool is developed that makes the interpretation of effects easy accessible.
regression models, categorical data, latent structures
Berger, Moritz
2016
Englisch
Universitätsbibliothek der Ludwig-Maximilians-Universität München
Berger, Moritz (2016): On the detection of latent structures in categorical data. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik
[thumbnail of Berger_Moritz.pdf]
Vorschau
PDF
Berger_Moritz.pdf

2MB

Abstract

With the growing availability of huge amounts of data it is increasingly important to uncover the underlying data generating structures. The present work focusses on the detection of latent structures for categorical data, which have been treated less intensely in the literature. In regression models categorical variables are either the responses or part of the covariates. Alternative strategies have to be used to detect the underlying structures. The first part of this thesis is dedicated to regression models with an excessive number of parameters. More concrete, we consider models with various categorical covariates and a potentially large number of categories. In addition, it is investigated how fixed effects models can be used to model the heterogeneity in longitudinal and cross-sectional data. One interesting aspect is to identify the categories or units that have to be distinguished with respect to their effect on the response. The objective is to detect ``latent groups'' that share the same effects on the response variable. A novel approach to the clustering of categorical predictors or fixed effects is introduced, which is based on recursive partitioning techniques. In contrast to competing methods that use specific penalties the proposed algorithm also works in high-dimensional settings. The second part of this thesis deals with item response models, which can be considered as regression models that aim at measuring ``latent abilities'' of persons. In item response theory one uses indicators such as the answers of persons to a collection of items to infer on the underlying abilities. When developing psychometric tests one has to be aware of the phenomenon of Differential Item Functioning (DIF). An item response model is affected by DIF if the difficulty of an item among equally able persons depends on characteristics of the persons, such as the membership to a racial or ethnic subgroup. A general tree-based method is proposed that simultaneously detects the items and subgroups of persons that carry DIF including a set of variables on different scales. Compared to classical approaches a main advantage is that the proposed method automatically identifies regions of the covariate space that are responsible for DIF and do not have to be prespecified. In addition, extensions to the detection of non-uniform DIF are developed. The last part of the thesis addresses regression models for rating scale data that are frequently used in behavioural research. Heterogeneity among respondents caused by ``latent response styles'' can lead to biased estimates and can affect the conclusion drawn from the observed ratings. The focus is on symmetric response categories and a specific form of response style, namely the tendency to the middle or extreme categories. In ordinal regression models a stronger or weaker concentration in the middle can also be interpreted as varying dispersion. The strength of the proposed models is that they can be embedded into the framework of generalized linear models and therefore inference techniques and asymptotic results for this class of models are available. In addition, a visualization tool is developed that makes the interpretation of effects easy accessible.