1 Discriminant Analysis

Discriminant analysis is a classification problem, where two or more groups or clusters or populations are known a priori and one or more new observations are classified into one of the known populations based on the measured characteristics.

Discriminant analysis works by creating one or more linear combinations of predictors, creating a new latent variable for each function. These functions are called discriminant functions. The first function created maximizes the differences between groups on that function. The second function maximizes differences on that function, but also must not be correlated with the previous function. This continues with subsequent functions with the requirement that the new function not be correlated with any of the previous functions.

A couple of things need to know for discriminant analysis

1.1 Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis(LDA) is a classification method originally developed in 1936 by R.A.Fisher. It is simple, mathematically robust and often produces models whose accuracy is as good as more complex methods.

We can think of discriminant analysis as drawing lines in space to (in some sense) optimally separate groups. In the linear case with two groups, we would have a discriminant function \[z = a_1x_1 + a_2x_2 + ... + a_px_p\] The vector \(a = (a_1, a_2, ..., a_p)'\) maximizes the ratio of the between-groups variance of \(z\) to its within- groups variance. LDA carries the following assumptions:

  • The data from each group have multivariate normal distributions.

  • The covariances of each group are the same.

For two groups and assuming both groups are equally probable (e.g. have equal prior probabilities), we can compute \(z_i\) from the \(i_{th}\) observation and compare it with \[z_c = \frac{\bar{z}_1 + \bar{z}_2}{2}\]

If group 1 has smaller mean \(z\) value, a new observation is classified as being in group 1 if {z_i < z_c} and in group 2 otherwise. In the case of more than 2 groups, there would be multiple (linear in the case of LDA) discriminant functions partitioning the space.

1.2 Quadratic Discriminant Analysis (QDA)

In quadratic discriminant analysis, we still assume multivariate normality but within-group covariances are no longer assumed to be equal and the discriminant functions are no longer linear, so QDA carries the following assumptions:

  • The data from each group have multivariate normal distributions.

  • The covariances of each group are not all the same.

2 k-Nearest Neighbors

The KNN or k-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. More specifically, the distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance, cosine similarity or the Manhattan distance. In other words, the similarity to the data that was already in the system is calculated for any new data point that you input into the system. Then, you use this similarity value to perform predictive modeling. Predictive modeling is either classification, assigning a label or a class to the new instance, or regression, assigning a value to the new instance. Whether you classify or assign a value to the new instance depends of course on your how you compose your model with KNN.

The k-nearest neighbor algorithm adds to this basic algorithm that after the distance of the new point to all stored data points has been calculated, the distance values are sorted and the k-nearest neighbors are determined. The labels of these neighbors are gathered and a majority vote or weighted vote is used for classification or regression purposes. In other words, the higher the score for a certain data point that was already stored, the more likely that the new instance will receive the same classification as that of the neighbor. In the case of regression, the value that will be assigned to the new data point is the mean of its k nearest neighbors.

The choice of K is essential in building the KNN model. In fact, k can be regarded as one of the most important factors of the model that can strongly influence the quality of predictions. One appropriate way to look at the number of nearest neighbors k is to think of it as a smoothing parameter. For any given problem, a small value of k will lead to a large variance in predictions. Alternatively, setting k to a large value may lead to a large model bias. Thus, k should be set to a value large enough to minimize the probability of misclassification and small enough (with respect to the number of cases in the example sample) so that the K nearest points are close enough to the query point. Thus, like any smoothing parameter, there is an optimal value for k that achieves the right trade off between the bias and the variance of the model. KNN can provide an estimate of K using an algorithm known as cross-validation.

############################################################################ 
# South African Heart Disease Data 
############################################################################ 
# The Coronary Risk‐Factor Study data involve 462 males between the ages of 15 and 64 from 
# a heart‐disease high‐risk region of the Western Cape, South Africa.
# The response is "chd", the presence (chd=1) or absence (chd=0) of coronary heart disease. 
# There are 9 covariates:
# sbp:  systolic blood pressure cumulative tobacco (kg)
# tobacco:  cumulative tobacco (kg)
# ldl:  low densiity lipoprotein cholesterol
# adiposity
# famhist:  family history of heart disease (Present, Absent)
# typea:  type‐A behavior
# obesity
# alcohol:  current alcohol consumption
# age:  age at onset

Heart = read.table('https://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data',
                  sep=",",head=T,row.names=1)
attach(Heart)
dim(Heart)
## [1] 462  10
head(Heart)
##   sbp tobacco  ldl adiposity famhist typea obesity alcohol age chd
## 1 160   12.00 5.73     23.11 Present    49   25.30   97.20  52   1
## 2 144    0.01 4.41     28.61  Absent    55   28.87    2.06  63   1
## 3 118    0.08 3.48     32.28 Present    52   29.14    3.81  46   0
## 4 170    7.50 6.41     38.03 Present    51   31.99   24.26  58   1
## 5 134   13.60 3.50     27.78 Present    60   25.99   57.34  49   1
## 6 132    6.20 6.47     36.21 Present    62   30.77   14.14  45   0
Heart$chd = as.factor(Heart$chd)

Num_train = round(dim(Heart)[1]*0.8)
Index_train = sample(nrow(Heart), Num_train)
Heart_train = Heart[Index_train,]
Heart_test = Heart[-Index_train,]
HeartFull = glm(chd~., data=Heart_train, family=binomial)
summary(HeartFull)
## 
## Call:
## glm(formula = chd ~ ., family = binomial, data = Heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8360  -0.8041  -0.4356   0.8536   2.5328  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.627964   1.500123  -4.418 9.95e-06 ***
## sbp             0.012319   0.006516   1.891 0.058668 .  
## tobacco         0.092421   0.030004   3.080 0.002068 ** 
## ldl             0.194224   0.066003   2.943 0.003254 ** 
## adiposity       0.030069   0.032798   0.917 0.359255    
## famhistPresent  0.986346   0.256582   3.844 0.000121 ***
## typea           0.043229   0.013649   3.167 0.001540 ** 
## obesity        -0.083934   0.048969  -1.714 0.086523 .  
## alcohol        -0.001928   0.004860  -0.397 0.691607    
## age             0.035942   0.013504   2.662 0.007776 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 478.49  on 369  degrees of freedom
## Residual deviance: 373.42  on 360  degrees of freedom
## AIC: 393.42
## 
## Number of Fisher Scoring iterations: 5
HeartStep = step(HeartFull, scope=list(upper=~., lower=~1), k=2, trace=0)
summary(HeartStep)
## 
## Call:
## glm(formula = chd ~ sbp + tobacco + ldl + famhist + typea + obesity + 
##     age, family = binomial, data = Heart_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9079  -0.8098  -0.4351   0.8751   2.5622  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -7.070551   1.419165  -4.982 6.29e-07 ***
## sbp             0.012583   0.006407   1.964 0.049544 *  
## tobacco         0.090811   0.029558   3.072 0.002124 ** 
## ldl             0.209208   0.064446   3.246 0.001169 ** 
## famhistPresent  0.976077   0.255085   3.826 0.000130 ***
## typea           0.042194   0.013575   3.108 0.001881 ** 
## obesity        -0.051441   0.032901  -1.564 0.117932    
## age             0.042571   0.011631   3.660 0.000252 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 478.49  on 369  degrees of freedom
## Residual deviance: 374.37  on 362  degrees of freedom
## AIC: 390.37
## 
## Number of Fisher Scoring iterations: 5
train_phat=HeartStep$fitted.values
logistic_train_pred = (train_phat>0.5)
table(Heart_train$chd, logistic_train_pred)
##    logistic_train_pred
##     FALSE TRUE
##   0   206   35
##   1    58   71
logistic_train_err = (63+30)/nrow(Heart_train)
logistic_train_err
## [1] 0.2513514
test_phat=predict(HeartStep,newdata = Heart_test)
logistic_test_pred = (test_phat>0.5)
table(Heart_test$chd, logistic_test_pred)
##    logistic_test_pred
##     FALSE TRUE
##   0    55    6
##   1    21   10
logistic_test_err = (23+4)/nrow(Heart_test)
logistic_test_err
## [1] 0.2934783
library(MASS)
LDA = lda(chd ~. ,data=Heart_train)
LDA_train_pred=predict(LDA, Heart_train[,-10])$class
table(Heart_train$chd, LDA_train_pred)
##    LDA_train_pred
##       0   1
##   0 209  32
##   1  59  70
LDA_train_err = mean(LDA_train_pred != Heart_train$chd)
LDA_train_err
## [1] 0.2459459
LDA_test_pred=predict(LDA, Heart_test[,-10])$class
table(Heart_test$chd, LDA_test_pred)
##    LDA_test_pred
##      0  1
##   0 50 11
##   1 16 15
LDA_test_err = mean(LDA_test_pred != Heart_test$chd)
LDA_test_err
## [1] 0.2934783
QDA = qda(chd ~. ,data=Heart_train)
QDA_train_pred=predict(QDA, Heart_train[,-10])$class
table(Heart_train$chd, QDA_train_pred)
##    QDA_train_pred
##       0   1
##   0 210  31
##   1  52  77
QDA_train_err = mean(QDA_train_pred != Heart_train$chd)
QDA_train_err
## [1] 0.2243243
QDA_test_pred=predict(QDA, Heart_test[,-10])$class
table(Heart_test$chd, QDA_test_pred)
##    QDA_test_pred
##      0  1
##   0 49 12
##   1 17 14
QDA_test_err = mean(QDA_test_pred != Heart_test$chd)
QDA_test_err
## [1] 0.3152174
library(class)
knn_test_pred <- knn(Heart_train[,-5], Heart_test[,-5], Heart_train$chd,k=20)
table(knn_test_pred, Heart_test$chd)
##              
## knn_test_pred  0  1
##             0 52 23
##             1  9  8
knn_test_err = mean(knn_test_pred != Heart_test$chd)
knn_test_err
## [1] 0.3478261