---
title: 'Lecture 9: Tree-based Methods'
author: '[Dr. Aijun Zhang](http://www.statsoft.org) - [STAT3612 Data Mining](http://www.statsoft.org/teaching/stat3612/)'
date: "Mar 30 & Apr 3, 2017"
output:
html_document:
highlight: tango
number_sections: yes
theme: paper
toc: yes
toc_depth: 3
html_notebook: default
---
```{r setup, include=FALSE}
options(width=100)
knitr::opts_chunk$set(echo = TRUE, message=FALSE, message=FALSE)
```
```{r echo=FALSE, fig.align="center", out.width = "200px", eval=T, fig.cap=c("Leo Breiman (1928 - 2005)")}
knitr::include_graphics(c('LeoBreiman.png'))
```
* **Classification and Regression Trees** by BFOS (1984), 31K+ citations
* **Bagging** by Breiman (1996), 16K+ citations
* **Random Forests** by Breiman (2001), 26K+ citations
Quote from Breiman (2001). "Statistical Modeling: The Two Cutures", *Statistical Science*, **16**, 199-231.
> The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
***
# Decision Trees
**Key Idea:** recursive partitioning, recursive binary splitting of the input space . This is a greedy strategy.
## Regression tree
```{r fig.align="center", fig.width=10, fig.asp=0.55}
DataX = data.frame(x1=iris$Sepal.Length, x2=iris$Sepal.Width, y=iris$Petal.Width)
library(tree)
fit_rtree = tree(y~x1+x2, DataX) # response being numeric: regression case
print(fit_rtree)
par(mfrow=c(1,2))
plot(fit_rtree, type="uniform")
text(fit_rtree)
title("Regression tree (Recursive Binary Partitioning)")
tmpfun = function(x,y) predict(fit_rtree, newdata=data.frame(x1=x, x2=y))
xgrid = seq(min(DataX$x1),max(DataX$x1),length=201)
ygrid = seq(min(DataX$x2),max(DataX$x2),length=201)
zgrid = outer(xgrid, ygrid, tmpfun)
image(xgrid, ygrid, zgrid, col=cm.colors(10),
xlab="Sepal.Length", ylab="Sepal.Width",
main="Regression tree for Iris Petal.Width Prediction")
```
* For a partitioned region $R$, predict the response with the sample mean $\hat{\mu}_R = \mbox{avg}(y_i|\mathbf{x}_i\in R)$
* Find the best splitting $j$th variabl with the split point $s$ by minimizng the RSS (residual sum of squares):
$$
\min_{j,s}\left[\sum_{\mathbf{x}_i \in R_1} (y_i - \hat{\mu}_{R_1})^2 + \sum_{\mathbf{x}_i \in R_2} (y_i - \hat\mu_{R_2})^2 \right]
$$
where the splitted regions $R_1 = \{\mathbf{x}_i | x_{ij} \leq s\}$ and $R_2 = \{\mathbf{x}_i | x_{ij} > s\}$.
* Cost-complexity pruning criterion for a tree with $|T|$ number of terminal nodes:
$$
\sum_{m=1}^{|T|} \sum_{\mathbf{x}_i\in R_m} (y_i - \hat{\mu}_{R_m})^2 + \alpha |T|
$$
where $\alpha$ is estimated by cross-validation.
## Classificaiton tree
```{r fig.align="center", fig.width=6, fig.asp=1.1}
DataX = data.frame(x1=iris$Petal.Length, x2=iris$Petal.Width, y=iris$Species)
fit_ctree = tree(y~x1+x2, DataX) # response being categroical: classification case
plot(fit_ctree, type="uniform")
text(fit_ctree)
tmpfun = function(x,y) as.numeric(predict(fit_ctree, newdata=data.frame(x1=x, x2=y),
type = "class"))
xgrid = seq(min(DataX$x1),max(DataX$x1),length=201)
ygrid = seq(min(DataX$x2),max(DataX$x2),length=201)
zgrid = outer(xgrid, ygrid, tmpfun)
FrontMap = c(rgb(1,0,0,1), rgb(0,1,0,1), rgb(0,0,1,1))
BackMap = c(rgb(1,0,0,.2), rgb(0,1,0,0.2), rgb(0,0,1,.2))
image(xgrid, ygrid, zgrid, col=BackMap,
xlab="Petal.Length", ylab="Petal.Width",
main="Classification tree with Iris dataset")
points(jitter(DataX$x1,2), jitter(DataX$x2, 2), col=FrontMap[DataX$y], pch=19)
```
```{r}
print(fit_ctree)
summary(fit_ctree)
```
**Classification purity measures:** For terminal node $m$, let $p_{mk}$ be the proportion of class-$k$ observations within the node, and let $n_{mk}$ be the number of class-$k$ observations. Then,
$$
\begin{aligned}
\mbox{Deviance}_m & = -2 \sum_k n_{mk}\log p_{mk}\\
\mbox{Entropy}_m & = -2 \sum_k p_{mk}\log p_{mk}\\
\mbox{Gini-index}_m & = 1- \sum_k p_{mk}^2
\end{aligned}
$$
These measures are minimized when all members of the node are of the same class.
## Tree Pruning
* Control the number of terminal nodes
```{r fig.align="center", fig.width=10, fig.asp=0.55}
library(ElemStatLearn)
DataX = data.frame(x1=mixture.example$x[,1],
x2=mixture.example$x[,2],
y=mixture.example$y)
xgrid = seq(min(DataX$x1),max(DataX$x1),length=201)
ygrid = seq(min(DataX$x2),max(DataX$x2),length=201)
ColMap1 = c(rgb(1,0,0,1), rgb(0,0,1,1))
ColMap2 = c(rgb(1,0,0,.2), rgb(0,0,1,.2))
library(tree)
fit_tree1 = tree(as.factor(y)~x1+x2, data = round(DataX,2))
pred_tree = function(x,y) as.numeric(predict(fit_tree1, newdata=data.frame(x1=x,x2=y), type="class"))
zgrid = outer(xgrid, ygrid, pred_tree)
par(mfrow=c(1,2))
image(xgrid, ygrid, zgrid, col=ColMap2,
xlab = "x1",ylab = "x2")
points(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19)
plot(fit_tree1); text(fit_tree1, cex=0.6)
```
Using `cv.tree` function for choosing tree complexity via cross-validation:
```{r fig.align="center", fig.width=6, fig.asp=0.8}
set.seed(2017)
tmp = cv.tree(fit_tree1, FUN=prune.misclass)
plot(tmp$size, tmp$dev, type="b",
xlab="Tree Size", ylab="Deviance",
main="Cross-Validation Results")
```
```{r fig.align="center", fig.width=10, fig.asp=0.55}
fit_tree2 = prune.misclass(fit_tree1, best=tmp$size[which.min(tmp$dev)])
pred_tree = function(x,y) as.numeric(predict(fit_tree2, newdata=data.frame(x1=x,x2=y), type="class"))
zgrid = outer(xgrid, ygrid, pred_tree)
par(mfrow=c(1,2))
image(xgrid, ygrid, zgrid, col=ColMap2,
xlab = "x1",ylab = "x2")
points(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19)
title(main = "Classification Tree")
plot(fit_tree2); text(fit_tree2, cex=0.6)
```
In summary, decision trees
* automatically select variables that are used to define the splits
* are easy to interpret for small-size trees (not so easy for large trees)
* recursively partition the input space as a divide-and-conquer operation
* may handle both numeric/cateogrical features seamlessly
* may deal with missing data
* often suffer from high-variance and therefore poor generalization performances
# Tree Ensembles
Tree-based ensembe learners use trees as building blocks to construct powerful prediction models. The key is to get rid of the variance by averaging.
## Bagging
Bagging means "boostrap aggregation" and it takes the form of
$$
\hat{f}_{\rm bag}(\mathbf{x}) = \frac{1}{B}\sum_{b=1}^B \hat{f}_b(\mathbf{x})
$$
where $\hat{f}_b(\mathbf{x})$ is trained on the $b$th bootstrap sample and $B$ is the total number of trained trees.
```{r fig.align="center", fig.width=6, fig.asp=1.05}
library(randomForest)
fit_bag = randomForest(as.factor(y)~x1+x2, data=DataX, mtry=2, ntree=300)
tmpfun = function(x,y) as.numeric(predict(fit_bag, newdata=data.frame(x1=x,x2=y)))
zgrid = outer(xgrid, ygrid, tmpfun)
image(xgrid, ygrid, zgrid, col=ColMap2,
xlab = "x1",ylab = "x2", main="Bagging")
points(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19)
```
```{r}
fit_bag
```
**Out-Of-Bag (OOB) Error Estimate: ** Recall the bootstrap resampling technique with replacement. For each of $n$ observations, the probability of not being covered by the $b$th bootstrap sample is
$$
\mbox{Pr}(\mathbf{x}_i \notin \mathbb{X}_b) = \left(1-\frac{1}{n}\right)^n \approx e^{-1} = 0.368.
$$
That is, around one-third of observations are not used to fit the bagged trees, which are called Out-Of-Bag (OOB) observations. Evaluating the prediction performance on the OOB observations yields the OOB error estimate, which is similar to the cross-validation estimate of test error.
## Random Forest
Random forests improves the bagging algorithm by **decorrelating** the trees via split-variable randomization. Each time only $m$ out of $p$ predictors are chosen as split variables. Typical values of $m$ are $\sqrt{p}$ (classification case) and $p/3$ (regression case).
Therefore, in building a random forest, each time the tree is trained by not only resampling observations (via bootstrap), but also randomly selecting a subset of the variables. The trees trained in such way have less correlated performance and make the averaging predictor less variable and more reliable.
```{r fig.align="center", fig.width=6, fig.asp=1.05}
bb = seq(5, 1000, by=5) # 5~1000
err = NULL
for (b in bb) {
fit1 = randomForest(as.factor(y)~x1+x2, data=DataX, mtry=2, ntree=b)
fit2 = randomForest(as.factor(y)~x1+x2, data=DataX, ntree=b)
err = rbind(err, c(mean(fit1$confusion[,"class.error"]),
mean(fit2$confusion[,"class.error"])))
}
matplot(bb, err, type="l", lty=1, col=c(2,4),
xlab="Number of trees", ylab="OOB Error Estimate",
main="Bagging vs. Random Forest")
legend("topright", c("Bagging", "Random Forest"), lty=1, col=c(2,4))
```
**Variable Importance: ** The bagging or random forests are like a black box that gives good predictions but hard to interpret the underlying fitted model. The `R:randomForests` provides a variable-importance plot to assess the relative importance of variables.
```{r fig.align="center", fig.width=6, fig.asp=0.7}
library(randomForest)
fit_rf = randomForest(Species~., data=iris)
fit_rf
varImpPlot(fit_rf, main="Variable-Imporantce Plot")
```
## Boosting
Boosting is another ensemble learning approach based on decision trees. Unlike bagging or random forests that fit trees parallelly on each resampled observations, boosting fit trees sequetially by using informaiton from prevous fitted trees.
Here we discuss only the regression tree as in Chapter 8 of ISLR2013 book.
```{r echo=FALSE, fig.align="center", out.width = "600px", eval=T, fig.cap=c("Boosting Tree for Regression (Source: ISLR2013)")}
knitr::include_graphics(c('BoostingTree.png'))
```
An example using `gbm` for regression based on the `Boston` house price data is provided at the Lab section of ISLR2013 book. Here we provide another quick example of using `gbm` for classification based on the `iris` dataset.
```{r fig.align="center", fig.width=6, fig.asp=1.0}
library(gbm)
fit_gbm = gbm(Species~., data=iris)
summary(fit_gbm)
```
**Remarks:** There exist different versions of boosting algorithms for regression and classification. Interested students may read Chapter 10 of Hastie, Tibshirani and Friedman (2009; 2ed). The `R:gbm` package may work equally well on classification problems. You may also refer to `R:xgboost` package with compelling distributed computing performance.
***
**Read ISLR2013 book (CHapter 8) and practise the lab codes with Carseats and Boston datasets!**