---
title: "Tutorial 9: Decision Tree, Ensemble Learning & Evaluation"
author: "[Simon & Zebin]"
date: "4/3/2018"
header-includes:
- \usepackage{amsmath}
output:
html_document:
fig_width: 10
fig_height: 6
highlight: tango
number_sections: yes
theme: paper
toc: yes
toc_depth: 3
html_notebook: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# The UCI Machine Learning Repository
* The UCI Machine Learning Repository [link](http://archive.ics.uci.edu/ml/index.php)
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets.
```{r echo=FALSE, fig.align="center", out.width = "800px", eval=T}
knitr::include_graphics(c('UCI.png'))
```
* The breast cancer dataset
The breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison. Attributes 2 through 10 have been used to represent instances and each instance has one of 2 possible classes: benign (not harmful) or malignant (serious).
The input variables are given in the followings:
1. - Sample code numbe id number
2. - Clump Thickness 1 - 10
3. - Uniformity of Cell Size 1 - 10
4. - Uniformity of Cell Shape 1 - 10
5. - Marginal Adhesion 1 - 10
6. - Single Epithelial Cell Size 1 - 10
8. - Bland Chromatin 1 - 10
9. - Normal Nucleoli 1 - 10
10. - Mitoses 1 - 10
Output variable:
11. - Class (2 for benign, 4 for malignant)
Download the data file
```{r}
loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
ds <- "breast-cancer-wisconsin/breast-cancer-wisconsin.data"
url <- paste(loc, ds, sep="")
breast <- read.table(url, sep=",", header=FALSE, na.strings="?")
names(breast) <- c("ID", "clumpThickness", "sizeUniformity",
"shapeUniformity", "maginalAdhesion",
"singleEpithelialCellSize", "bareNuclei",
"blandChromatin", "normalNucleoli", "mitosis", "class")
```
Check the number of missing entries
```{r}
sum(is.na(breast))
```
Ignore the ID variable
```{r}
df <- breast[-1]
df$class <- factor(df$class, levels=c(2,4),labels=c("benign", "malignant"))
```
Parition the data into training and testing
```{r}
sample_index <- sample(nrow(df), 0.7*nrow(df))
df.train <- df[sample_index,]
df.validate <- df[-sample_index,]
```
Frequency of class variable
```{r}
table(df.train$class)
table(df.validate$class)
```
# Decision Tree
## Introduction
The Decision Tree Model partitions the space of predictors into regions. This is done in the following ways:
1. - Splitting the space by one predictor at a time.
2. - Given the current partition with $\vert T\vert$ regions,
a. - Select a region $R_{k}$, a predictor $X_{j}$, and a splitting point $s$, such that splitting $R_{k}$ with a criterion $X_{j} < s$ produces the largest decrease in
$$RSS = \sum_{m=1}^{\vert T\vert }\sum_{i\in R_{m}}(y_{i}-\overline{y}_{R_{m}})^{2}$$
b. - Redefine the regions with the additional split.
c. - Terminate when either there are 5 observations or fewer in a region or RSS does not drop by more than a threshold with any cut.
3. - The prediction for a point in region $R_{i}$ is the average of the training points in $R_{i}$.
4. - Since the set of splitting rules used to partition the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.
5. - The leave nodes of the tree are the set of regions which partitions the predictor space.
6. - The points along the tree where the predictor space is split are referred to as internal nodes.
```{r echo=FALSE, fig.align="center", out.width = "400px", eval=T}
knitr::include_graphics(c('trees.jpg'))
```
However, because it is possible to find good partition after bad one, it makes sense to obtain a large tree based on the above-mentioned binary method and then prune the tree from the leaves to the root.
Weakest link pruning:
* - Starting with the initial full tree $T_{0}$, replace a subtree with a leaf node to obtain a new tree $T_{1}$. Select subtree to prune by minimizing
$$\frac{RSS(T_{1})-RSS(T_{0})}{\vert T_{1}\vert -\vert T_{0}\vert }.$$
* - Iterate this pruning to obtain a sequence of $T_{0}, T_{1},\ldots, T_{m}$ where $T_{m}$ is the tree with the single leave node, the root.
* - Select the optimal tree $T_{i}$ by cross validation.
An equivalent procedure is to find optimal $\alpha$ which minimizes
$$RSS(\alpha) = \sum_{m=1}^{\vert T\vert }\sum_{i\in R_{m}}(y_{i}-\overline{y}_{R_{m}})^{2} + \alpha\vert T\vert$$
When $y$'s are categorical variables, other minimization functions are usually chosen:
1. - The Gini Index
$$\sum_{m=1}^{\vert T\vert }q_{m}\sum_{k=1}^{K}p_{mk}(1-p_{mk}),$$
where $p_{mk}$ is the proportion of class $k$ within $R_{m}$ and $q_{m}$ is the proportion of samples in $R_{m}$.
2. - The cross entropy
$$-\sum_{m=1}^{\vert T\vert }q_{m}\sum_{k=1}^{K}p_{mk}\log(p_{mk}).$$
3. - The mis-classification rule
$$\sum_{m=1}^{\vert T\vert }q_{m}\sum_{k\in R_{m}}I( y_{i}\neq\widehat{y}_{R_{m}}),$$
where $\widehat{y}_{R_{m}}$ is the majority class of points in $R_{m}$.
4. - It is custom to use the Gini index or the cross entropy to grow the tree while using the mis-classification rule to prune the tree.
## Experiments
```{r, results='hide', warning=FALSE, message=FALSE}
library(rpart)
library(rpart.plot)
options(warn=-1) #suppress warnings
```
use cross-entropy as the splitting index (the default is to use gini) to grow the tree
```{r}
dtree <- rpart(class ~ ., data=df.train, method="class", parms=list(split="information"))
#summary(dtree)
```
Summary of the splitting steps. It stops as soon as the classification error is below 0.01.
```{r}
#printcp(dtree)
```
```{r}
dtree$cptable
```
Plot the classification error
```{r}
plotcp(dtree)
```
Pruning
```{r}
dtree.pruned <- prune(dtree, cp=.0125)
```
Plot the decision tree
```{r}
prp(dtree.pruned, type = 2, extra = 104,fallen.leaves = TRUE, main="Decision Tree")
```
Prediction of the test set
```{r}
dtree.pred <- predict(dtree.pruned, df.validate, type="class")
```
The confusion table, dnn = dimension names
```{r}
dtree.perf <- table(df.validate$class, dtree.pred, dnn=c("Actual", "Predicted"))
dtree.perf
```
# Ensemble Learning
Improving your models with little modification.
```{r echo=FALSE, fig.align="center", out.width = "800px", eval=T}
knitr::include_graphics(c('Bias_Variance.png'))
```
(From http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture)
## Bagging
1. - In the Bootstrap step, we replicate our data set by sampling with replacement
Original dataset x = c(x1,x2,...,x100)
Bootstrap samples: boot1=sample(x,100, replace=TRUE),...,bootB=sample(x,100,replace=TRUE).
2. - In Bagging, we average the predictions of a model fit to many Bootstrap samples.
```{r echo=FALSE, fig.align="center", out.width = "800px", eval=T}
knitr::include_graphics(c('Bagging.png'))
```
(From https://prachimjoshi.wordpress.com/)
```{r}
library(foreach)
bagging<-function(training,testing,length_divisor=5,iterations=1000) {
predictions<-foreach(m=1:iterations,.combine=cbind) %do% {
training_positions <- sample(nrow(training), size=floor((nrow(training)/length_divisor)))
train_pos<-1:nrow(training) %in% training_positions
dtree <- rpart(class ~ ., data=training[train_pos,], method="class", parms=list(split="information"))
predict(dtree.pruned, testing, type="class")
}
rowMeans(predictions)
}
pred <- bagging(df.train,df.validate)
bagging.perf <- table(df.validate$class, pred, dnn=c("Actual", "Predicted"))
bagging.perf
```
As you can see, the improvement can be very little. The trees produced by different Bootstrap samples can be very similar.
## Random Forests
Very deep decision trees may overfit their training sets, i.e. have low bias, but very high variance. Random forests are a way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance in the final model.
* Random decision forests: feature bagging, is an ensemble learning method that attempts to reduce the correlation between estimators in an ensemble by training them on random samples of features instead of the entire feature set;
* Random Forests: Random decision forests + "bagging". The procedures of implementing random forests is:
1. - Perform bagging sampling;
2. - For each bagging samples, randomly select $m$ variables out of $p$ variables each time ($m < p$, the opimal $m$ usually around $\sqrt{p}$);
3. - Fit a basic learner (decision tree, linear classifiers, support vector machines, nearest neighbours and any other type of classifiers or regressors) for different samples;
4. - Repeat the step 1) to 3) for predetermined times, and average the prediction of each tree.
```{r echo=FALSE, fig.align="center", out.width = "800px", eval=T}
knitr::include_graphics(c('RF.png'))
```
With reference to:
1. "https://en.wikipedia.org/wiki/Random_forest";
2. "Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome (2008). The Elements of Statistical Learning (2nd ed.). Springer. ISBN 0-387-95284-5."
2. "https://cdn-images-1.medium.com/max/1600/0*tG-IWcxL1jg7RkT0.png".
```{r, wa}
suppressMessages(library(randomForest))
df.validate <- df.validate[complete.cases(df.validate),] #because the function does not allow missing observations
fit.rf = randomForest(class ~., data=df.train,importance=TRUE,proximity=TRUE, na.action=na.omit,ntree=500)
print(fit.rf) #printing the confusion matrix
```
Explainability: graphically display the importance of predictors
```{r}
round(importance(fit.rf),2)
varImpPlot(fit.rf)
```
Plotting the error decrease per number of trees
```{r}
plot(fit.rf)
```
Prediction on the testing set
```{r}
Prediction <- predict(fit.rf, df.validate)
fit.rf.perf <- table(df.validate$class, Prediction, dnn=c("Actual", "Predicted"))
fit.rf.perf
```
## Boosting
Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance.
* Weighting: At each iteration of the training process, a weight is assigned to each sample in the training set equal to the current error on that sample;
* At each iteration, a weak learner is trained on the reweighted sample;
* Combination: Build a strong classifier as the weighted linear combination of the weak classifiers (based on their accuracy).
An example with linear classifier:
```{r echo=FALSE, fig.align="center", out.width = "800px", eval=T}
knitr::include_graphics(c('Boosting.png'))
```
(From http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture)
```{r}
suppressMessages(library(gbm))
boost.fit <- gbm((unclass(class)-1)~.,data=df.train,distribution="bernoulli",n.trees=5000,interaction.depth=4)
pred.boost<-predict(boost.fit,newdata=df.validate,n.trees=5000,type="response")
```
pred.boost gives the probability that class="benign".
```{r}
suppressMessages(library(adabag))
adaboost<-boosting(class ~., data=df.train, boos=TRUE, mfinal=20,coeflearn='Breiman')
```
To show the importance of predictors
```{r}
adaboost$importance
```
To show the relative errors
```{r}
errorevol(adaboost,df.train)
```
To plot the resulting tree
```{r}
t1<-adaboost$trees[[1]]
suppressMessages(library(tree))
plot(t1)
text(t1,pretty=0)
```
# Classification Accuracy Evaluation
$$
\begin{array}
[c]{c|c|c}%
\text{Actual\Predict} & \widehat{Y}=1 & \widehat{Y}=0\\
Y=1 & A & B\\
Y=0 & C & D
\end{array}
$$
1. Correct prediction rate $\frac{A+D}{A+B+C+D}$
2. Sensitivity $P( \widehat{Y}=1|Y=1) = \frac{A}{A+B}$
3. Specificity $P( \widehat{Y}=0|Y=0) = \frac{D}{C+D}$
4. False positive rate $P(Y=0| \widehat{Y}=1) = \frac{C}{A+C}$
5. False negative rate $P(Y=1| \widehat{Y}=0) = \frac{B}{B+D}$
```{r}
perf.evaluation <- function(table, n=2){
if(!all(dim(table) == c(2,2))) stop("Must be a 2 x 2 table")
tp = table[1,1] #A
fn = table[1,2] #B
fp = table[2,1] #C
tn = table[2,2] #D
sensitivity = tp/(tp+fn)
specificity = tn/(tn+fp)
ppp = tp/(tp+fp)
npp = tn/(tn+fn)
hitrate = (tp+tn)/(tp+tn+fp+fn)
result <- paste("Sensitivity = ", round(sensitivity, n) ,
"\nSpecificity = ", round(specificity, n),
"\nPositive Predictive Value = ", round(ppp, n),
"\nNegative Predictive Value = ", round(npp, n),
"\nAccuracy = ", round(hitrate, n), "\n", sep="")
cat(result)
}
perf.evaluation(dtree.perf)
```
## Confusion Matrix
To show the confusion matrix
```{r}
predict_class = pred.boost > 0.5
predict_class = factor(predict_class, levels=c(FALSE,TRUE), labels=c("benign", "malignant"))
boost.perf = table(df.validate$class, predict_class, dnn=c("Actual", "Predicted"))
boost.perf
```
Ploting the confusion matrix
```{r}
library(ggplot2)
TClass <- factor(c("benign", "benign", "malignant", "malignant"))
PClass <- factor(c("benign", "malignant", "benign", "malignant"))
Y <- c(boost.perf)
df <- data.frame(TClass, PClass, Y)
ggplot(data = df, mapping = aes(x = TClass, y = PClass)) +
geom_tile(aes(fill = Y), colour = "white") +
geom_text(aes(label = sprintf("%1.0f", Y)), vjust = 1) +
scale_fill_gradient(low = "blue", high = "red") +
theme_bw() + theme(legend.position = "none")
```
## ROC Curve
```{r}
suppressMessages(library(ROCR))
pred1 <- prediction(pred.boost, df.validate$class)
perf1 <- performance(pred1,"tpr","fpr")
plot(perf1)
```
(For more details: https://stats.stackexchange.com/questions/105501/understanding-roc-curve)