Created: 20 Nov 2014 | Modified: 30 Jun 2016 | BibTeX Entry | RIS Citation |
I didn’t find any problems with random forests, but the time needed to do a properly tuned model fit using caret
and a reasonable grid of mtry
values, on 100K, 200K, or 800K data points pretty prohibitive, especially if you’re doing repeated CV to find the best tuned model fit on the training set. A model fit and tuning was taking 24 to 36 hours, even when parallelized on 4 cores. Since I’m doing many analyses, the options seemed to boil down to (1) run every analysis on large EC2 instances with 16 or 32 cores, hoping to shave down the analytic time to something reasonable, or (2) find a classifier with high accuracy but a shorter runtime.
Boosted classification trees are always among the top classifier algorithms, and in Friedman’s gradient boosting version, achieve extremely high accuracy while mostly avoiding overfitting, and providing measures of predictor importance. Gradient boosted machines, like random forests, break the classic tradeoff between bias and variance, allowing us (within limits) to optimize both (Alexey Natekin 2013; T. Hastie et al. 2009). They do so by combining several approaches to randomization and aggregation:
Using the superb caret
package from Max Kuhn allows an easy switch between classifiers without changing most of the analysis code, so the change was simple. By default, the gbm
package through caret
uses bootstrap resampling with a small number of boosting iterations, which is sufficient for seeing how the package works, but not a real analysis. Currently, I am using the following tuning parameters:
In my initial tests, plots of CV accuracy tend to plateau after 100 iterations, with larger interaction depth, so technically I don’t really need to include smaller values in the tuning parameters. But the gbm + caret
package optimizes by fitting the largest models, can derive the smaller ones from the full model without recalculation, so the execution time is largely controlled by the largest values of the tuning parameters and the number of repeated CV folds and repeats.
In most cases, a model fit to 100K data points, with 20% held out as a strict test set, takes between 80 and 90 minutes, with 4 cores on a Retina MBP. That’s acceptable, although I’ve put some effort into ensuring that all of my analyses can be run on the command line via a Makefile, automatically storing their results in .Rdata
files for later interpretation and use. That way, I can do final runs on a server instance.
The quality of the classifications is comparable to, and even slightly higher than, the random forest classifications of these data. The difference is very small and likely not significant, so for present purposes, I would consider GBM and RF to be equivalent and acceptable methods for attacking equifinality issues between cultural transmission models.
Alexey Natekin, Alois Knoll. 2013. “Gradient boosting machines, a tutorial.” Frontiers in Neurorobotics 7. Frontiers Media SA: 21. doi:10.3389/fnbot.2013.00021.
Hastie, Trevor, Robert Tibshirani, Jerome Friedman, T Hastie, J Friedman, and R Tibshirani. 2009. The Elements of Statistical Learning. Vol. 2. 1. Springer.