Analysis of missing data with random forests

www.lmu.de | UB | Blättern | FAQ

Zur erweiterten Suche

English

Zur erweiterten Suche

Random Forests are widely used for data prediction and interpretation purposes. They show many appealing characteristics, such as the ability to deal with high dimensional data, complex interactions and correlations. Furthermore, missing values can easily be processed by the built-in procedure of surrogate splits. However, there is only little knowledge about the properties of recursive partitioning in missing data situations. Therefore, extensive simulation studies and empirical evaluations have been conducted to gain deeper insight. In addition, new methods have been developed to enhance methodology and solve current issues of data interpretation, prediction and variable selection. A variable’s relevance in a Random Forest can be assessed by means of importance measures. Unfortunately, existing methods cannot be applied when the data contain miss- ing values. Thus, one of the most appreciated properties of Random Forests – its ability to handle missing values – gets lost for the computation of such measures. This work presents a new approach that is designed to deal with missing values in an intuitive and straightforward way, yet retains widely appreciated qualities of existing methods. Results indicate that it meets sensible requirements and shows good variable ranking properties. Random Forests provide variable selection that is usually based on importance mea- sures. An extensive review of corresponding literature led to the development of a new approach that is based on a profound theoretical framework and meets important statis- tical properties. A comparison to another eight popular methods showed that it controls the test-wise and family-wise error rate, provides a higher power to distinguish relevant from non-relevant variables and leads to models located among the best performing ones. Alternative ways to handle missing values are the application of imputation methods and complete case analysis. Yet it is unknown to what extent these approaches are able to provide sensible variable rankings and meaningful variable selections. Investigations showed that complete case analysis leads to inaccurate variable selection as it may in- appropriately penalize the importance of fully observed variables. By contrast, the new importance measure decreases for variables with missing values and therefore causes se- lections that accurately reﬂect the information given in actual data situations. Multiple imputation leads to an assessment of a variable’s importance and to selection frequencies that would be expected for data that was completely observed. In several performance evaluations the best prediction accuracy emerged from multiple imputation, closely fol- lowed by the application of surrogate splits. Complete case analysis clearly performed worst.

Not available

Hapfelmeier, Alexander

12. Oct. 2012

2012

Englisch

Universitätsbibliothek der Ludwig-Maximilians-Universität München

https://nbn-resolving.org/urn:nbn:de:bvb:19-150588

Hapfelmeier, Alexander (2012): Analysis of missing data with random forests. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik

[thumbnail of Hapfelmeier_Alexander.pdf]

Vorschau

PDF
Hapfelmeier_Alexander.pdf
4MB

DOI: 10.5282/edoc.15058

URN: urn:nbn:de:bvb:19-150588

Abstract

Dokumententyp:	Dissertationen (Dissertation, LMU München)
Themengebiete:	000 Allgemeines, Informatik, Informationswissenschaft 000 Allgemeines, Informatik, Informationswissenschaft > 004 Informatik
Fakultäten:	Fakultät für Mathematik, Informatik und Statistik
Sprache der Hochschulschrift:	Englisch
Datum der mündlichen Prüfung:	12. Oktober 2012
1. Berichterstatter:in:	Ulm, Kurt
MD5 Prüfsumme der PDF-Datei:	3027907191593a4387b624804c301d51
Signatur der gedruckten Ausgabe:	0001/UMC 20823
ID Code:	15058
Eingestellt am:	13. Dec. 2012 14:46
Letzte Änderungen:	24. Oct. 2020 01:43

Nur für Administratoren und Editoren: Dokument bearbeiten