Statistical Issues in Machine Learning

www.lmu.de | UB | Blättern | FAQ

Zur erweiterten Suche

English

Zur erweiterten Suche

Statistical Issues in Machine Learning. Towards Reliable Split Selection and Variable Importance Measures

Recursive partitioning methods from machine learning are being widely applied in many scientific fields such as, e.g., genetics and bioinformatics. The present work is concerned with the two main problems that arise in recursive partitioning, instability and biased variable selection, from a statistical point of view. With respect to the first issue, instability, the entire scope of methods from standard classification trees over robustified classification trees and ensemble methods such as TWIX, bagging and random forests is covered in this work. While ensemble methods prove to be much more stable than single trees, they also loose most of their interpretability. Therefore an adaptive cutpoint selection scheme is suggested with which a TWIX ensemble reduces to a single tree if the partition is sufficiently stable. With respect to the second issue, variable selection bias, the statistical sources of this artifact in single trees and a new form of bias inherent in ensemble methods based on bootstrap samples are investigated. For single trees, one unbiased split selection criterion is evaluated and another one newly introduced here. Based on the results for single trees and further findings on the effects of bootstrap sampling on association measures, it is shown that, in addition to using an unbiased split selection criterion, subsampling instead of bootstrap sampling should be employed in ensemble methods to be able to reliably compare the variable importance scores of predictor variables of different types. The statistical properties and the null hypothesis of a test for the random forest variable importance are critically investigated. Finally, a new, conditional importance measure is suggested that allows for a fair comparison in the case of correlated predictor variables and better reflects the null hypothesis of interest.

CART, bagging, random forest, Gini index, variable importance

Strobl, Carolin

02. Jul. 2008

2008

Englisch

Universitätsbibliothek der Ludwig-Maximilians-Universität München

https://nbn-resolving.org/urn:nbn:de:bvb:19-89043

Strobl, Carolin (2008): Statistical Issues in Machine Learning: Towards Reliable Split Selection and Variable Importance Measures. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik

Vorschau

PDF
Strobl_Carolin.pdf
1MB

DOI: 10.5282/edoc.8904

URN: urn:nbn:de:bvb:19-89043

Abstract

Dokumententyp:	Dissertationen (Dissertation, LMU München)
Keywords:	CART, bagging, random forest, Gini index, variable importance
Themengebiete:	500 Naturwissenschaften und Mathematik > 510 Mathematik 500 Naturwissenschaften und Mathematik
Fakultäten:	Fakultät für Mathematik, Informatik und Statistik
Sprache der Hochschulschrift:	Englisch
Datum der mündlichen Prüfung:	2. Juli 2008
1. Berichterstatter:in:	Augustin, Thomas
MD5 Prüfsumme der PDF-Datei:	a6da7495ee244509a2c5c60a14addbe5
Signatur der gedruckten Ausgabe:	0001/UMC 17190
ID Code:	8904
Eingestellt am:	27. Aug. 2008 12:54
Letzte Änderungen:	24. Oct. 2020 07:08

Nur für Administratoren und Editoren: Dokument bearbeiten