eScholarship Repository eScholarship Repository California Digital Library
eScholarship > CBMB > Paper bench_rf_regn

CBMB Papers

CBMB Website

Policies

Search CBMB

Submit a Paper

Notify me of new papers

institute_logo

Center for Bioinformatics & Molecular Biostatistics
University of California, San Francisco

CBMB Papers  •  CBMB Website  •  Policies  •  Search CBMB  •  Submit a Paper

Machine Learning Benchmarks and Random Forest Regression
Mark R. Segal, University of California, San Francisco

Download the Paper (631 K, PDF file) - April 14, 2004 Tell a colleague about it.
Printing Tips: Select 'print as image' in the Acrobat print dialog if you have trouble printing.

ABSTRACT:

Breiman (2001a,b) has recently developed an ensemble classification and regression approach that displayed outstanding performance with regard prediction error on a suite of benchmark datasets. As the base constituents of the ensemble are tree-structured predictors, and since each of these is constructed using an injection of randomness, the method is called ‘random forests’. That the exceptional performance is attained with seemingly only a single tuning parameter, to which sensitivity is minimal, makes the methodology all the more remarkable. The individual trees comprising the forest are all grown to maximal depth. While this helps with regard bias, there is the familiar tradeoff with variance. However, these variability concerns were potentially obscured because of an interesting feature of those benchmarking datasets extracted from the UCI machine learning repository for testing: all these datasets are hard to overfit using tree-structured methods. This raises issues about the scope of the repository.

With this as motivation, and coupled with experience from boosting methods, we revisit the formulation of random forests and investigate prediction performance on real-world and simulated datasets for which maximally sized trees do overfit. These explorations reveal that gains can be realized by additional tuning to regulate tree size via limiting the number of splits and/or the size of nodes for which splitting is allowed. Nonetheless, even in these settings, good performance for random forests can be attained by using larger (than default) primary tuning parameter values.

SUGGESTED CITATION:
Mark R. Segal, "Machine Learning Benchmarks and Random Forest Regression" (April 14, 2004). Center for Bioinformatics & Molecular Biostatistics. Paper bench_rf_regn.
http://repositories.cdlib.org/cbmb/bench_rf_regn

 
bar
Open Archives Initiative eScholarship is a service of the California Digital Library bepress