---
title: "STAT3612_2312_DATAMINING_7TH_TUTORIAL"
author: "YOU Jia"
date: "3/20/2017"
output:
html_document:
highlight: tango
number_sections: yes
theme: paper
toc: yes
toc_depth: 3
html_notebook: default
---
# Linear regression
In linear regression, we considered models of the form
$$y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i}+ ... + \beta_px_{pi} + \epsilon_i$$
where $\epsilon_i \sim N(0, \hat{\sigma}^2)$
Alternatively, we can write
$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{1i} + \hat{\beta}_2x_{2i}+ ... + \hat{\beta}_px_{pi}$$
where $Y_i \sim N(\hat{y}_i, \hat{\sigma}^2)$
For linear regression, each of the responses is assumed to be from a normal distribution with mean equal to the expected value and a common variance estimated by $\hat{\sigma}^2$ . Also, the response is continuous and can potentially take on any real value.
In many situations the assumptions of the linear regression model will not be appropriate. One such case is in modeling a dichotomous response (a response that can only take on two possible values).
# Logistic Regression
In such cases we would instead prefer to model the probability of observing one of the possible outcomes. We can code our dichotomous response variable as 0’s and 1’s and model the probability π of being in case 1 (or case 0 if we prefer) given the predictors. Our binary response is then distributed according to a Bernoulli distribution
$$Y_i \sim Ber(\hat{\pi}_i)$$
and the expected value of $Y_i$ is $\hat{y}_i = \hat{\pi}_i$
The odds of an event is the probability that an event happens divided by the probability that event
doesn’t happen. In the logistic regression model, we model the log odds as a linear function
$$log\big(\frac{\hat{\pi}_i}{1-\hat{\pi}_i}\big) = \hat{\beta}_0 + \hat{\beta}_1x_{1i} + \hat{\beta}_2x_{2i}+ ... + \hat{\beta}_px_{pi}$$
The expected odds would then be given by
$$\frac{\hat{\pi}_i}{1-\hat{\pi}_i} = exp\big(\hat{\beta}_0 + \hat{\beta}_1x_{1i} + \hat{\beta}_2x_{2i}+ ... + \hat{\beta}_px_{pi}\big)$$
Thus the estimated probability is
$$\hat{\pi}_i = \frac{exp\big(\hat{\beta}_0 + \hat{\beta}_1x_{1i} + \hat{\beta}_2x_{2i}+ ... + \hat{\beta}_px_{pi}\big)}{1+exp\big(\hat{\beta}_0 + \hat{\beta}_1x_{1i} + \hat{\beta}_2x_{2i}+ ... + \hat{\beta}_px_{pi}\big)}$$
where the $exp$ is the exponential function
Several things need to notice:
* We no longer have an assumption of normality
* Estimation for the $\hat{\beta}_j$ will come from maximum likelihood estimation assuming Bernoulli
(or binomial) distributed responses
* The $\hat{\pi}_i$ values will fall between 0 and 1
* Holding all other predictor variables constant, a change of 1 in the $j_th$ predictor will result in a
change of $exp(\hat{\beta}_j)$ in the odds
* The odds ratio between two sets of conditions (values for the predictors variables) will tell us how much greater (or less) the odds of an event is under one set of conditions than under another (e.g. how many times greater is the odds for females to encounter a specific event than for males, or how much greater is the odds for an event to happen if a continuous predictor increases by 1)
* Confidence intervals for the $\hat{\beta}_j$ will be symmetric, but confidence intervals for the odds and odds ratios will not be symmetric because they are based on exponentials of the symmetric intervals.
If we have counts of events out of a possible number of trial (e.g. if we count the number of heads out of n coin tosses), our counts would follow binomial distributions instead of Bernoulli (recall that Bernoulli is a binomial with n=1), and we would now have
$$Y_i \sim Bin(n_i,\hat{\pi}_i)$$
and the expected value of $Y_i$ is $\hat{y}_i = n_i\hat{\pi}_i$
```{r}
############################################################################
# South African Heart Disease Data
############################################################################
# The Coronary Risk‐Factor Study data involve 462 males between the ages of 15 and 64 from
# a heart‐disease high‐risk region of the Western Cape, South Africa.
# The response is "chd", the presence (chd=1) or absence (chd=0) of coronary heart disease.
# There are 9 covariates:
# sbp: systolic blood pressure cumulative tobacco (kg)
# tobacco: cumulative tobacco (kg)
# ldl: low densiity lipoprotein cholesterol
# adiposity
# famhist: family history of heart disease (Present, Absent)
# typea: type‐A behavior
# obesity
# alcohol: current alcohol consumption
# age: age at onset
Heart = read.table('https://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data',
sep=",",head=T,row.names=1)
attach(Heart)
head(Heart)
HeartFull = glm(chd~., data=Heart, family=binomial)
summary(HeartFull)
```
```{r, cache=FALSE}
HeartStep = step(HeartFull, scope=list(upper=~., lower=~1), k=2)
summary(HeartStep)
```
```{r}
phat=HeartStep$fitted.values
mypred = (phat>0.5)
table(Heart$chd, mypred)
error=(46+73)/nrow(Heart)
error
```