---
title: "STAT3612_2312_DATAMINING_8TH_TUTORIAL"
author: "YOU Jia"
date: "3/20/2017"
output:
html_document:
highlight: tango
number_sections: yes
theme: paper
toc: yes
toc_depth: 3
html_notebook: default
---
# Discriminant Analysis
Discriminant analysis is a classification problem, where two or more groups or clusters or populations are known a priori and one or more new observations are classified into one of the known populations based on the measured characteristics.
Discriminant analysis works by creating one or more linear combinations of predictors, creating a new latent variable for each function. These functions are called discriminant functions. The first function created maximizes the differences between groups on that function. The second function maximizes differences on that function, but also must not be correlated with the previous function. This continues with subsequent functions with the requirement that the new function not be correlated with any of the previous functions.
A couple of things need to know for discriminant analysis
* We have data from known groups.
* We want to determine one or more functions based on explanatory variables to classify those
groups.
* The classification is based on group means.
* Our goal is to be able to classify new data into one of the known groups based on the
classification defined by discriminant function(s).
## Linear Discriminant Analysis (LDA)
Linear Discriminant Analysis(LDA) is a classification method originally developed in 1936 by R.A.Fisher. It is simple, mathematically robust and often produces models whose accuracy is as good as more complex methods.
We can think of discriminant analysis as drawing lines in space to (in some sense) optimally separate groups. In the linear case with two groups, we would have a discriminant function
$$z = a_1x_1 + a_2x_2 + ... + a_px_p$$
The vector $a = (a_1, a_2, ..., a_p)'$ maximizes the ratio of the between-groups variance of $z$ to its within- groups variance. LDA carries the following assumptions:
* The data from each group have multivariate normal distributions.
* The covariances of each group are the same.
For two groups and assuming both groups are equally probable (e.g. have equal prior probabilities), we can compute $z_i$ from the $i_{th}$ observation and compare it with
$$z_c = \frac{\bar{z}_1 + \bar{z}_2}{2}$$
If group 1 has smaller mean $z$ value, a new observation is classified as being in group 1 if {z_i < z_c} and in group 2 otherwise. In the case of more than 2 groups, there would be multiple (linear in the case of LDA) discriminant functions partitioning the space.
## Quadratic Discriminant Analysis (QDA)
In quadratic discriminant analysis, we still assume multivariate normality but within-group covariances are no longer assumed to be equal and the discriminant functions are no longer linear, so QDA carries the following assumptions:
* The data from each group have multivariate normal distributions.
* The covariances of each group are not all the same.
# k-Nearest Neighbors
The KNN or k-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. More specifically, the distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance, cosine similarity or the Manhattan distance. In other words, the similarity to the data that was already in the system is calculated for any new data point that you input into the system. Then, you use this similarity value to perform predictive modeling. Predictive modeling is either classification, assigning a label or a class to the new instance, or regression, assigning a value to the new instance. Whether you classify or assign a value to the new instance depends of course on your how you compose your model with KNN.
The k-nearest neighbor algorithm adds to this basic algorithm that after the distance of the new point to all stored data points has been calculated, the distance values are sorted and the k-nearest neighbors are determined. The labels of these neighbors are gathered and a majority vote or weighted vote is used for classification or regression purposes. In other words, the higher the score for a certain data point that was already stored, the more likely that the new instance will receive the same classification as that of the neighbor. In the case of regression, the value that will be assigned to the new data point is the mean of its k nearest neighbors.
The choice of K is essential in building the KNN model. In fact, k can be regarded as one of the most important factors of the model that can strongly influence the quality of predictions. One appropriate way to look at the number of nearest neighbors k is to think of it as a smoothing parameter. For any given problem, a small value of k will lead to a large variance in predictions. Alternatively, setting k to a large value may lead to a large model bias. Thus, k should be set to a value large enough to minimize the probability of misclassification and small enough (with respect to the number of cases in the example sample) so that the K nearest points are close enough to the query point. Thus, like any smoothing parameter, there is an optimal value for k that achieves the right trade off between the bias and the variance of the model. KNN can provide an estimate of K using an algorithm known as cross-validation.
```{r}
############################################################################
# South African Heart Disease Data
############################################################################
# The Coronary Risk‐Factor Study data involve 462 males between the ages of 15 and 64 from
# a heart‐disease high‐risk region of the Western Cape, South Africa.
# The response is "chd", the presence (chd=1) or absence (chd=0) of coronary heart disease.
# There are 9 covariates:
# sbp: systolic blood pressure cumulative tobacco (kg)
# tobacco: cumulative tobacco (kg)
# ldl: low densiity lipoprotein cholesterol
# adiposity
# famhist: family history of heart disease (Present, Absent)
# typea: type‐A behavior
# obesity
# alcohol: current alcohol consumption
# age: age at onset
Heart = read.table('https://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data',
sep=",",head=T,row.names=1)
attach(Heart)
dim(Heart)
head(Heart)
Heart$chd = as.factor(Heart$chd)
Num_train = round(dim(Heart)[1]*0.8)
Index_train = sample(nrow(Heart), Num_train)
Heart_train = Heart[Index_train,]
Heart_test = Heart[-Index_train,]
HeartFull = glm(chd~., data=Heart_train, family=binomial)
summary(HeartFull)
```
```{r, cache=FALSE}
HeartStep = step(HeartFull, scope=list(upper=~., lower=~1), k=2, trace=0)
summary(HeartStep)
```
```{r}
train_phat=HeartStep$fitted.values
logistic_train_pred = (train_phat>0.5)
table(Heart_train$chd, logistic_train_pred)
logistic_train_err = (63+30)/nrow(Heart_train)
logistic_train_err
```
```{r}
test_phat=predict(HeartStep,newdata = Heart_test)
logistic_test_pred = (test_phat>0.5)
table(Heart_test$chd, logistic_test_pred)
logistic_test_err = (23+4)/nrow(Heart_test)
logistic_test_err
```
```{r}
library(MASS)
LDA = lda(chd ~. ,data=Heart_train)
LDA_train_pred=predict(LDA, Heart_train[,-10])$class
table(Heart_train$chd, LDA_train_pred)
LDA_train_err = mean(LDA_train_pred != Heart_train$chd)
LDA_train_err
```
```{r}
LDA_test_pred=predict(LDA, Heart_test[,-10])$class
table(Heart_test$chd, LDA_test_pred)
LDA_test_err = mean(LDA_test_pred != Heart_test$chd)
LDA_test_err
```
```{r}
QDA = qda(chd ~. ,data=Heart_train)
QDA_train_pred=predict(QDA, Heart_train[,-10])$class
table(Heart_train$chd, QDA_train_pred)
QDA_train_err = mean(QDA_train_pred != Heart_train$chd)
QDA_train_err
```
```{r}
QDA_test_pred=predict(QDA, Heart_test[,-10])$class
table(Heart_test$chd, QDA_test_pred)
QDA_test_err = mean(QDA_test_pred != Heart_test$chd)
QDA_test_err
```
```{r}
library(class)
knn_test_pred <- knn(Heart_train[,-5], Heart_test[,-5], Heart_train$chd,k=20)
table(knn_test_pred, Heart_test$chd)
knn_test_err = mean(knn_test_pred != Heart_test$chd)
knn_test_err
```