Logo Logo
Hilfe
Kontakt
Switch language to English
Handling of realistic missing data scenarios in clinical trials using machine learning techniques
Handling of realistic missing data scenarios in clinical trials using machine learning techniques
Missing data problem is a common challenge when designing and analyzing clinical trials, which are the data that are needed for the main analyses but are not collected. If the missing data are not properly imputed/handled, they may cause following issues: reduce the statistical power of the important analysis; they may bias/ confound the treatment effect estimation; they may cause an underestimation of the variability in target variable. Three different types of missingness are defined in Rubin’s 1976 paper. (1) MCAR (missing completely at random): when data are MCAR, “the probability of missingness does not depend on observed or unobserved measurements”, for example, subjects who dropout from the trial due to the reasons that are not related to their health status. (2) MAR (missing at random): when data are MAR, “the probability of missingness depends only on observed measurements conditional on the covariates in the model”, for example, younger subjects (those who don’t think it is necessary to measure their blood pressure as they consider themselves healthier) may more likely to have missing blood pressure. (3) MNAR (missing not at random): when data are MNAR, “the probability of missingness depends on unobserved measurements”, for example, subjects leave the trial because of “lack of efficacy” (i.e., they are not convinced by effec-tiveness of the study drug and hence dropout from the trial). Although all three types of missing data are well defined, it is very difficult to determine the association between missing data and unobserved outcomes in the real-world data; in other words, it is very difficult to justify the MAR assumption in any realistic situation. As EMA suggested in 2010, a combined strategy can be used, e.g., treat the discontinu-ations due to “lack of efficacy” as MNAR data, and treat the discontinuations due to “lost to follow-up” as MAR data. Many statistical methods have been developed to handle missing data under the prerequisite assumption of either MNAR or MAR. However, in the real world, missing data are often mixed with different types of missing mechanisms. This violates the basic assumptions for missing data (i.e., either MNAR or MAR), which leads to a degradation in the processing performance of these methods (Enders, 2010). To handle the missing data problem in reallife situations (e.g., MNAR and MAR mixed together in the same dataset), we propose a missing data prediction framework that are based on machine learning techniques. As Breiman pointed out in his 2001 paper, in the statistical (ma-chine) learning exercise, “the goal is not interpretability, but accurate information”. Along this line of thought, our methods handle MNAR by focusing on (giving more sample weights to) the missing part, meanwhile, and also to handle the MAR data by looking for precise individual (subject-level) information. The problem of MNAR is seen as an imbalanced machine learning exercise, i.e., to oversample the minority cases to compen-sate for the data that are MNAR in certain area.
clinical trials, missing data, machine learning, imbalanced learning, clustering
Haliduola, Halimuniyazi
2023
Englisch
Universitätsbibliothek der Ludwig-Maximilians-Universität München
Haliduola, Halimuniyazi (2023): Handling of realistic missing data scenarios in clinical trials using machine learning techniques. Dissertation, LMU München: Medizinische Fakultät
[thumbnail of Haliduola_Halimuniyazi.pdf] PDF
Haliduola_Halimuniyazi.pdf

6MB

Abstract

Missing data problem is a common challenge when designing and analyzing clinical trials, which are the data that are needed for the main analyses but are not collected. If the missing data are not properly imputed/handled, they may cause following issues: reduce the statistical power of the important analysis; they may bias/ confound the treatment effect estimation; they may cause an underestimation of the variability in target variable. Three different types of missingness are defined in Rubin’s 1976 paper. (1) MCAR (missing completely at random): when data are MCAR, “the probability of missingness does not depend on observed or unobserved measurements”, for example, subjects who dropout from the trial due to the reasons that are not related to their health status. (2) MAR (missing at random): when data are MAR, “the probability of missingness depends only on observed measurements conditional on the covariates in the model”, for example, younger subjects (those who don’t think it is necessary to measure their blood pressure as they consider themselves healthier) may more likely to have missing blood pressure. (3) MNAR (missing not at random): when data are MNAR, “the probability of missingness depends on unobserved measurements”, for example, subjects leave the trial because of “lack of efficacy” (i.e., they are not convinced by effec-tiveness of the study drug and hence dropout from the trial). Although all three types of missing data are well defined, it is very difficult to determine the association between missing data and unobserved outcomes in the real-world data; in other words, it is very difficult to justify the MAR assumption in any realistic situation. As EMA suggested in 2010, a combined strategy can be used, e.g., treat the discontinu-ations due to “lack of efficacy” as MNAR data, and treat the discontinuations due to “lost to follow-up” as MAR data. Many statistical methods have been developed to handle missing data under the prerequisite assumption of either MNAR or MAR. However, in the real world, missing data are often mixed with different types of missing mechanisms. This violates the basic assumptions for missing data (i.e., either MNAR or MAR), which leads to a degradation in the processing performance of these methods (Enders, 2010). To handle the missing data problem in reallife situations (e.g., MNAR and MAR mixed together in the same dataset), we propose a missing data prediction framework that are based on machine learning techniques. As Breiman pointed out in his 2001 paper, in the statistical (ma-chine) learning exercise, “the goal is not interpretability, but accurate information”. Along this line of thought, our methods handle MNAR by focusing on (giving more sample weights to) the missing part, meanwhile, and also to handle the MAR data by looking for precise individual (subject-level) information. The problem of MNAR is seen as an imbalanced machine learning exercise, i.e., to oversample the minority cases to compen-sate for the data that are MNAR in certain area.