1 The UCI Machine Learning Repository

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. The archive was created as an ftp archive in 1987 by David Aha and fellow graduate students at UC Irvine. Since that time, it has been widely used by students, educators, and researchers all over the world as a primary source of machine learning data sets.

The breast cancer databases was obtained from the University of Wisconsin Hospitals, Madison. Attributes 2 through 10 have been used to represent instances and each instance has one of 2 possible classes: benign (not harmful) or malignant (serious).

The input variables are given in the followings:

    • Sample code numbe id number
    • Clump Thickness 1 - 10
    • Uniformity of Cell Size 1 - 10
    • Uniformity of Cell Shape 1 - 10
    • Marginal Adhesion 1 - 10
    • Single Epithelial Cell Size 1 - 10
    • Bland Chromatin 1 - 10
    • Normal Nucleoli 1 - 10
    • Mitoses 1 - 10

Output variable: 11. - Class (2 for benign, 4 for malignant)

Download the data file

loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/"
ds <- "breast-cancer-wisconsin/breast-cancer-wisconsin.data"
url <- paste(loc, ds, sep="")
breast <- read.table(url, sep=",", header=FALSE, na.strings="?")
names(breast) <- c("ID", "clumpThickness", "sizeUniformity",
 "shapeUniformity", "maginalAdhesion",
 "singleEpithelialCellSize", "bareNuclei",
 "blandChromatin", "normalNucleoli", "mitosis", "class")

Check the number of missing entries

sum(is.na(breast))
## [1] 16

Ignore the ID variable

df <- breast[-1]
df$class <- factor(df$class, levels=c(2,4),labels=c("benign", "malignant"))

Parition the data into training and testing

sample_index <- sample(nrow(df), 0.7*nrow(df))
df.train <- df[sample_index,]
df.validate <- df[-sample_index,]

Frequency of class variable

table(df.train$class)
## 
##    benign malignant 
##       316       173
table(df.validate$class)
## 
##    benign malignant 
##       142        68

2 Decision Tree

2.1 Introduction

The Decision Tree Model partitions the space of predictors into regions. This is done in the following ways:

    • Splitting the space by one predictor at a time.
    • Given the current partition with \(\vert T\vert\) regions,
        • Select a region \(R_{k}\), a predictor \(X_{j}\), and a splitting point \(s\), such that splitting \(R_{k}\) with a criterion \(X_{j} < s\) produces the largest decrease in \[RSS = \sum_{m=1}^{\vert T\vert }\sum_{i\in R_{m}}(y_{i}-\overline{y}_{R_{m}})^{2}\]
        • Redefine the regions with the additional split.
        • Terminate when either there are 5 observations or fewer in a region or RSS does not drop by more than a threshold with any cut.
    • The prediction for a point in region \(R_{i}\) is the average of the training points in \(R_{i}\).
    • Since the set of splitting rules used to partition the predictor space can be summarized in a tree, these types of approaches are known as decision tree methods.
    • The leave nodes of the tree are the set of regions which partitions the predictor space.
    • The points along the tree where the predictor space is split are referred to as internal nodes.