---
title: "Tutorial 7: Logistic Regression"
author: "[Simon & Zebin]"
date: "3/20/2018 & 3/22/2018"
header-includes:
- \usepackage{amsmath}
output:
html_document:
fig_width: 10
fig_height: 6
highlight: tango
number_sections: yes
theme: paper
toc: yes
toc_depth: 3
html_notebook: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
First, let's start with logistic function. A logistic function take any real input $t$ and output values between zero and one. Thus, it can be interpretable as a probability.
$$\sigma(t) =\frac{1}{1+e^{-t}}$$
Next, change $t$ with some explanatory variable $x$, e.g., $t=\beta _{0}+\beta _{1}x$, we obtain:
$$p(x)={\frac {1}{1+e^{-(\beta _{0}+\beta _{1}x)}}}$$
where $p(x) = P(Y_{i}=1\mid x)$ and $e^{\beta _{0}+\beta _{1}x}$ is the odds of an event. By definition, odds is the probability that an event happens over the probability it doesn¡¯t happen. In logistic regression, we model the log odds as a linear function:
$$log\big(\frac{p(x)}{1-p(x)}\big) = \beta _0 + \beta _{1}x$$
Also, it can be extended to multiple variables, as:
$$p(x)={\frac {1}{1+e^{-(\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\cdots +\beta _{m}x_{m})}}}$$
The differnce between linear regression and logistic regression:
* In linear regression, we deal with continuous response and the error term needs to be zero mean normal distribution. (However, this assumption may be not appropriate in some problems)
* In logistic regression, we model the probability of the binary response following a Bernoulli distribution.
```{r echo=FALSE, fig.align="center", out.width = "800px", eval=T}
knitr::include_graphics(c('linear_vs_logistic_regression.jpg'))
```
(Image from: www.machinelearningplus.com/logistic-regression-tutorial-examples-r/)
Several things need to notice:
* We no longer have an assumption of normality;
* Estimation for the $\hat{\beta}_j$ will come from maximum likelihood estimation assuming Bernoulli
(or binomial) distributed responses;
* Holding all other predictor variables constant, a change of 1 in the $j_th$ predictor will result in a
change of $exp(\hat{\beta}_j)$ in the odds;
* The odds ratio between two sets of conditions (values for the predictors variables) will tell us how much greater (or less) the odds of an event is under one set of conditions than under another (e.g. how many times greater is the odds for females to encounter a specific event than for males, or how much greater is the odds for an event to happen if a continuous predictor changes;
* Confidence intervals for the $\hat{\beta}_j$ will be symmetric, but confidence intervals for the odds and odds ratios will not be symmetric because they are based on exponentials of the symmetric intervals.
Some extension:
* Assuming binomial distributions instead of Bernoulli (recall that Bernoulli is a binomial with n=1);
* Softmax regression as a multi-class extension of logistic regression.
# Example 1: Logistic regression for Bank Marketing Prediction
A Bank Marketing Dataset for predicting whether a client will subscrib a term deposit (binary: "yes","no").
With accurate prediction, this model could be used to select target clients and reduce cost.
## Data Introduction
Client Basic Information:
* 1 - age (numeric)
* 2 - job: (categorical£º"blue-collar", "entrepreneur", "student", "services", "unemployed", ...)
* 3 - marital: marital status (categorical: "divorced", "married", "single", "unknown")
* 4 - education (categorical: "basic9y", "high.school", "university.degree", "illiterate", ...)
* 5 - default: has credit in default? (categorical: "no","yes","unknown")
* 6 - housing: has housing loan? (categorical: "no","yes","unknown")
* 7 - loan: has personal loan? (categorical: "no","yes","unknown")
Last Contact Information:
* 8 - contact: contact communication type (categorical: "cellular","telephone")
* 9 - month: last contact month of year (categorical: "Jan", "Fb", "Mar",...)
* 10 - day_of_week: last contact day of the week (categorical: "Mon","Tue","Wed","Thu","Fri")
* 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
* 12 - campaign: number of contacts performed during this campaign and for this client
* 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign
(numeric; 999 means client was not previously contacted)
* 14 - previous: number of contacts performed before this campaign and for this client (numeric)
* 15 - poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")
Social and Economic Context Attributes
* 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
* 17 - cons.price.idx: consumer price index - monthly indicator (numeric)
* 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
* 19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
* 20 - nr.employed: number of employees - quarterly indicator (numeric)
## Experients
```{r, results='hide', warning=FALSE, message=FALSE}
library(dplyr)
library(dummies)
require(foreign)
require(ggplot2)
require(MASS)
require(Hmisc)
require(reshape2)
library(glmnet)
options(warn=-1) #suppress warnings
```
```{r}
bank_additional_full <- read.csv("bank-additional-full.csv",header=TRUE)
colnames(bank_additional_full)
```
Divide the training and testing set
```{r}
sample_index = sample(1:nrow(bank_additional_full),floor(nrow(bank_additional_full)*0.80))
bank_train = bank_additional_full[sample_index,]
bank_test = bank_additional_full[-sample_index,]
```
Form Data Matrix
```{r}
x <- data.matrix(bank_train[,-ncol(bank_train)])
y <- data.matrix(bank_train[,ncol(bank_train)])
```
Use glmnet to perform regularization and find parameter estimates
```{r}
logmodel <- cv.glmnet(x,y, alpha=1,family="binomial")
coef <- as.matrix(coef(logmodel,s=logmodel$lambda.min))
```
To construct prediction probability of "yes"
```{r}
tx <- data.matrix(bank_test[,-ncol(bank_test)])
p <- vector("numeric",length=0)
for (i in 1:nrow(tx)) {
s <- sum(tx[i,]*coef[2:length(coef)])
p[i] <- exp(coef[1] + s)/(1+exp(coef[1] + s))
}
bank_test$Prob <- p
```
Display portion of the testing dataset
```{r}
tail(bank_test,5)
```
Evaluate the accuracy of correctly identify a customer who subscribed a term deposit
```{r}
bank_test$Pred <- ifelse(bank_test$Prob > 0.5,"yes","no")
length(which(bank_test$Pred == bank_test$y))/nrow(bank_test)
```
# Example 2: Logistic Regression on Ordinal Data
This session introduces logistic regression techniques on ordinal data.
Ordinal variable is a class variable with preference.
## Data Introduction
The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine, and the task is to predict the quality of wines using a score between 0 and 10.
The input variables in the two datasets are given in the followings:
* 1 - fixed acidity
* 2 - volatile acidity
* 3 - citric acid
* 4 - residual sugar
* 5 - chlorides
* 6 - free sulfur dioxide
* 7 - total sulfur dioxide
* 8 - density
* 9 - pH
* 10 - sulphates
* 11 - alcohol
```{r}
red_wines <- read.csv("winequality-red.csv",header=TRUE,stringsAsFactors=FALSE)
white_wines <- read.csv("winequality-white.csv",header=TRUE,stringsAsFactors=FALSE)
red_wines$type <- "red"
white_wines$type <- "white"
wines <- rbind(red_wines,white_wines)
wines$type <- factor(wines$type)
head(wines,5)
```
Summary of the Data
```{r}
table(factor(wines$quality))
```
```{r}
sapply(wines[, 1:11], summary)
```
```{r}
sapply(wines[, 1:11], sd)
```
Display of Boxplots
```{r}
ggplot(wines, aes(x = quality, y = pH)) +
geom_boxplot(size = .75) +
geom_jitter(alpha = .5) +
facet_grid(~ type, margins = TRUE) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
```
## Experiments
Cumulative Logit Model with proportional odds
* $Y$ - Categorical variable with $c$ categories
* $X$ - Explanatory variables
$$P\left( Y\leq j\right) , j=1,2,\ldots,c-1$$
$$logit\left( P\left( Y\leq j\right) \right) =\log\left[ P\left( Y\leq j\right) /P\left( Y>j\right) \right] =\alpha_{j}+\beta^{^{\prime}}X$$
check the proportional odds assumption
```{r}
glm(I(quality >= 4) ~ type, family="binomial", data = wines)
```
```{r}
glm(I(quality >= 6) ~ type, family="binomial", data = wines)
```
Since the assumption appears valid on certain occasions, it may be better to group some of the quality level.
```{r}
wines$quality2 <- ifelse(wines$quality %in% c(3,4,5),1,ifelse(wines$quality %in% c(7,8,9),3,2))
```
recheck the proportional odds assumption
```{r}
glm(I(quality2 > 1) ~ type, family="binomial", data = wines)
```
```{r}
glm(I(quality2 > 2) ~ type, family="binomial", data = wines)
```
The assumption is valid now.
Divide the data into training and testing set
```{r}
wines$quality2 <- factor(wines$quality2)
sample_index = sample(1:nrow(wines),floor(nrow(wines)*0.80))
wines_train = wines[sample_index,]
wines_test = wines[-sample_index,]
```
fit ordered logit model and store results in 'm'
```{r}
vlist <- paste(colnames(wines)[-c(12,14)],collapse=" + ")
m <- polr(paste("quality2 ~",vlist), data = wines_train, Hess=TRUE)
```
view a summary of the model
```{r}
summary(m)
```
store table
```{r}
ctable <- coef(summary(m))
```
calculate and store the p values
```{r}
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
```
combined table
```{r}
ctable <- cbind(ctable, "p value" = p)
```
Profile Confidence Interval
```{r, eval=FALSE}
ci <- confint(m) # default method gives profiled CIs
```
Confidence Interval assuming Normality
```{r}
ci <- confint.default(m) # CIs assuming normality
```
Calculating the odds ratio
```{r}
exp(coef(m))
```
OR and CI
```{r}
exp(cbind(OR = coef(m), ci))
```
Obtaining the predicted class probabilities
```{r}
testdat <- cbind(wines_test, predict(m, wines_test, type = "probs"))
```
show first few rows
```{r}
head(testdat,5)
```
Use average class to predict class membership
```{r}
s <- vector("numeric",length=0)
for (i in 1:nrow(testdat)) {
s[i] <- round(sum(testdat[i,c("1","2","3")]*c(1:3)))
}
testdat$pred_quality = s
```
The accuracy of using average class
```{r}
length(which(testdat$quality2 == testdat$pred_quality))/nrow(testdat)
```
Use max probability to predict class membership
```{r}
s <- vector("numeric",length=0)
for (i in 1:nrow(testdat)) {
s[i] <- which(testdat[i,c("1","2","3")]==max(testdat[i,c("1","2","3")]))
}
testdat$pred_quality_max = s
```
The accuracy of using max probability
```{r}
length(which(testdat$quality2 == testdat$pred_quality_max))/nrow(testdat)
```