Logo Logo
Help
Contact
Switch language to German
Addressing the challenges of uncertainty in regression models for high dimensional and heterogeneous data from observational studies
Addressing the challenges of uncertainty in regression models for high dimensional and heterogeneous data from observational studies
The lack of replicability in research findings from different scientific disciplines has gained wide attention in the last few years and led to extensive discussions. In this `replication crisis', different types of uncertainty play an important role, which occur at different points of data collection and statistical analysis. Nevertheless, the consequences are often ignored in current research practices with the risk of low credibility and reliability of research findings. For the analysis and the development of solutions to this problem, we define measurement uncertainty, sampling uncertainty, data pre-processing uncertainty, method uncertainty, and model uncertainty, and investigate them in particular in the context of regression analyses. Therefore, we consider data from observational studies with the focus on high dimensionality and heterogeneous variables, which are characteristics of growing importance. High dimensional data, i.e., data with more variables than observations, play an important role in the area of medical research, where large amounts of molecular data (omics data) can be collected with ever decreasing expense and effort. Where several types of omics data are available, we are additionally faced with heterogeneity. Moreover, heterogeneous data can be found in many observational studies, where data originate from different sources, or where variables of different types are collected. This work comprises four contributions with different approaches to this topic and a different focus of investigation. Contribution 1 can be considered as a practical example to illustrate data pre-processing and method uncertainty in the context of prediction and variable selection from high dimensional and heterogeneous data. In the first part of this paper, we introduce the development of priority-Lasso, a hierarchical method for prediction using multi-omics data. Priority-Lasso is based on standard Lasso and assumes a pre-specified priority order of blocks of data. The idea is to successively fit Lasso models on these blocks of data and to take the linear predictor from every fit as an offset in the fit of the block with next lowest priority. In the second part, we apply this method in a current study of acute myeloid leukemia (AML) and compare its performance to standard Lasso. We illustrate data pre-processing and method uncertainty, caused by different choices of variable definitions and specifications of settings in the application of the method. These choices result in different effect estimates and thus different prediction performances and selected variables. In the second contribution, we compare method uncertainty with sampling uncertainty in the context of variable selection and ranking of omics biomarkers. For this purpose, we develop a user-friendly and versatile framework. We apply this framework on data from AML patients with high dimensional and heterogeneous characteristics and explore three different scenarios: First, variable selection in multivariable regression based on multi-omics data, second, variable ranking based on variable importance measures from random forests, and, third, identification of genes based on differential gene expression analysis. In contributions 3 and 4, we apply the vibration of effects framework, which was initially used to analyze model uncertainty in a large epidemiological study (NHANES), to assess and compare different types of uncertainty. The two contributions intensively address the methodological extension of this framework to different types of uncertainty. In contribution 3, we describe the extension of the vibration of effects framework to sampling and data pre-processing uncertainty. As a practical illustration, we take a large data set from psychological research with heterogeneous variable structure (SAPA-project), and examine sampling, model and data pre-processing uncertainty in the context of logistic regression for varying sample sizes. Beyond the comparison of single types of uncertainty, we introduce a strategy which allows quantifying cumulative model and data pre-processing uncertainty and analyzing their relative contributions to the total uncertainty with a variance decomposition. Finally, we extend the vibration of effects framework to measurement uncertainty in contribution 4. In a practical example, we conduct a comparison study between sampling, model and measurement uncertainty on the NHANES data set in the context of survival analysis. We focus on different scenarios of measurement uncertainty which differ in the choice of variables considered to be measured with error. Moreover, we analyze the behavior of different types of uncertainty with increasing sample sizes in a large simulation study.
Not available
Klau, Simon
2020
English
Universitätsbibliothek der Ludwig-Maximilians-Universität München
Klau, Simon (2020): Addressing the challenges of uncertainty in regression models for high dimensional and heterogeneous data from observational studies. Dissertation, LMU München: Faculty of Mathematics, Computer Science and Statistics
[img]
Preview
PDF
Klau_Simon.pdf

3MB

Abstract

The lack of replicability in research findings from different scientific disciplines has gained wide attention in the last few years and led to extensive discussions. In this `replication crisis', different types of uncertainty play an important role, which occur at different points of data collection and statistical analysis. Nevertheless, the consequences are often ignored in current research practices with the risk of low credibility and reliability of research findings. For the analysis and the development of solutions to this problem, we define measurement uncertainty, sampling uncertainty, data pre-processing uncertainty, method uncertainty, and model uncertainty, and investigate them in particular in the context of regression analyses. Therefore, we consider data from observational studies with the focus on high dimensionality and heterogeneous variables, which are characteristics of growing importance. High dimensional data, i.e., data with more variables than observations, play an important role in the area of medical research, where large amounts of molecular data (omics data) can be collected with ever decreasing expense and effort. Where several types of omics data are available, we are additionally faced with heterogeneity. Moreover, heterogeneous data can be found in many observational studies, where data originate from different sources, or where variables of different types are collected. This work comprises four contributions with different approaches to this topic and a different focus of investigation. Contribution 1 can be considered as a practical example to illustrate data pre-processing and method uncertainty in the context of prediction and variable selection from high dimensional and heterogeneous data. In the first part of this paper, we introduce the development of priority-Lasso, a hierarchical method for prediction using multi-omics data. Priority-Lasso is based on standard Lasso and assumes a pre-specified priority order of blocks of data. The idea is to successively fit Lasso models on these blocks of data and to take the linear predictor from every fit as an offset in the fit of the block with next lowest priority. In the second part, we apply this method in a current study of acute myeloid leukemia (AML) and compare its performance to standard Lasso. We illustrate data pre-processing and method uncertainty, caused by different choices of variable definitions and specifications of settings in the application of the method. These choices result in different effect estimates and thus different prediction performances and selected variables. In the second contribution, we compare method uncertainty with sampling uncertainty in the context of variable selection and ranking of omics biomarkers. For this purpose, we develop a user-friendly and versatile framework. We apply this framework on data from AML patients with high dimensional and heterogeneous characteristics and explore three different scenarios: First, variable selection in multivariable regression based on multi-omics data, second, variable ranking based on variable importance measures from random forests, and, third, identification of genes based on differential gene expression analysis. In contributions 3 and 4, we apply the vibration of effects framework, which was initially used to analyze model uncertainty in a large epidemiological study (NHANES), to assess and compare different types of uncertainty. The two contributions intensively address the methodological extension of this framework to different types of uncertainty. In contribution 3, we describe the extension of the vibration of effects framework to sampling and data pre-processing uncertainty. As a practical illustration, we take a large data set from psychological research with heterogeneous variable structure (SAPA-project), and examine sampling, model and data pre-processing uncertainty in the context of logistic regression for varying sample sizes. Beyond the comparison of single types of uncertainty, we introduce a strategy which allows quantifying cumulative model and data pre-processing uncertainty and analyzing their relative contributions to the total uncertainty with a variance decomposition. Finally, we extend the vibration of effects framework to measurement uncertainty in contribution 4. In a practical example, we conduct a comparison study between sampling, model and measurement uncertainty on the NHANES data set in the context of survival analysis. We focus on different scenarios of measurement uncertainty which differ in the choice of variables considered to be measured with error. Moreover, we analyze the behavior of different types of uncertainty with increasing sample sizes in a large simulation study.