First, let’s start with logistic function. A logistic function take any real input \(t\) and output values between zero and one. Thus, it can be interpretable as a probability.

\[\sigma(t) =\frac{1}{1+e^{-t}}\]

Next, change \(t\) with some explanatory variable \(x\), e.g., \(t=\beta _{0}+\beta _{1}x\), we obtain:

\[p(x)={\frac {1}{1+e^{-(\beta _{0}+\beta _{1}x)}}}\]

where \(p(x) = P(Y_{i}=1\mid x)\) and \(e^{\beta _{0}+\beta _{1}x}\) is the odds of an event. By definition, odds is the probability that an event happens over the probability it doesn’t happen. In logistic regression, we model the log odds as a linear function:

\[log\big(\frac{p(x)}{1-p(x)}\big) = \beta _0 + \beta _{1}x\]

Also, it can be extended to multiple variables, as:

\[p(x)={\frac {1}{1+e^{-(\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\cdots +\beta _{m}x_{m})}}}\]

The differnce between linear regression and logistic regression:

- In linear regression, we deal with continuous response and the error term needs to be zero mean normal distribution. (However, this assumption may be not appropriate in some problems)
- In logistic regression, we model the probability of the binary response following a Bernoulli distribution.

(Image from: www.machinelearningplus.com/logistic-regression-tutorial-examples-r/)

Several things need to notice:

We no longer have an assumption of normality;

Estimation for the \(\hat{\beta}_j\) will come from maximum likelihood estimation assuming Bernoulli (or binomial) distributed responses;

Holding all other predictor variables constant, a change of 1 in the \(j_th\) predictor will result in a change of \(exp(\hat{\beta}_j)\) in the odds;

The odds ratio between two sets of conditions (values for the predictors variables) will tell us how much greater (or less) the odds of an event is under one set of conditions than under another (e.g. how many times greater is the odds for females to encounter a specific event than for males, or how much greater is the odds for an event to happen if a continuous predictor changes;

Confidence intervals for the \(\hat{\beta}_j\) will be symmetric, but confidence intervals for the odds and odds ratios will not be symmetric because they are based on exponentials of the symmetric intervals.

Some extension:

- Assuming binomial distributions instead of Bernoulli (recall that Bernoulli is a binomial with n=1);
- Softmax regression as a multi-class extension of logistic regression.

A Bank Marketing Dataset for predicting whether a client will subscrib a term deposit (binary: “yes”,“no”). With accurate prediction, this model could be used to select target clients and reduce cost.

Client Basic Information:

- 1 - age (numeric)
- 2 - job: (categorical：“blue-collar”, “entrepreneur”, “student”, “services”, “unemployed”, …)
- 3 - marital: marital status (categorical: “divorced”, “married”, “single”, “unknown”)
- 4 - education (categorical: “basic9y”, “high.school”, “university.degree”, “illiterate”, …)
- 5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)
- 6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
- 7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)

Last Contact Information:

- 8 - contact: contact communication type (categorical: “cellular”,“telephone”)
- 9 - month: last contact month of year (categorical: “Jan”, “Fb”, “Mar”,…)
- 10 - day_of_week: last contact day of the week (categorical: “Mon”,“Tue”,“Wed”,“Thu”,“Fri”)
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

- 12 - campaign: number of contacts performed during this campaign and for this client
- 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
- 14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”)

Social and Economic Context Attributes

- 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
- 17 - cons.price.idx: consumer price index - monthly indicator (numeric)
- 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

- 19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
- 20 - nr.employed: number of employees - quarterly indicator (numeric)

```
library(dplyr)
library(dummies)
require(foreign)
require(ggplot2)
require(MASS)
require(Hmisc)
require(reshape2)
library(glmnet)
options(warn=-1) #suppress warnings
```

```
bank_additional_full <- read.csv("bank-additional-full.csv",header=TRUE)
colnames(bank_additional_full)
```

```
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "duration" "campaign"
## [13] "pdays" "previous" "poutcome" "emp.var.rate"
## [17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
## [21] "y"
```

Divide the training and testing set

```
sample_index = sample(1:nrow(bank_additional_full),floor(nrow(bank_additional_full)*0.80))
bank_train = bank_additional_full[sample_index,]
bank_test = bank_additional_full[-sample_index,]
```

Form Data Matrix

```
x <- data.matrix(bank_train[,-ncol(bank_train)])
y <- data.matrix(bank_train[,ncol(bank_train)])
```

Use glmnet to perform regularization and find parameter estimates

```
logmodel <- cv.glmnet(x,y, alpha=1,family="binomial")
coef <- as.matrix(coef(logmodel,s=logmodel$lambda.min))
```

To construct prediction probability of “yes”

```
tx <- data.matrix(bank_test[,-ncol(bank_test)])
p <- vector("numeric",length=0)
for (i in 1:nrow(tx)) {
s <- sum(tx[i,]*coef[2:length(coef)])
p[i] <- exp(coef[1] + s)/(1+exp(coef[1] + s))
}
bank_test$Prob <- p
```

Display portion of the testing dataset

`tail(bank_test,5)`

```
## age job marital education default housing loan
## 41168 32 admin. married university.degree no yes no
## 41170 62 services married high.school no yes no
## 41172 33 student married professional.course no yes no
## 41179 62 retired married university.degree no no no
## 41183 29 unemployed single basic.4y no yes no
## contact month day_of_week duration campaign pdays previous
## 41168 cellular nov wed 236 3 999 0
## 41170 cellular nov wed 154 5 999 0
## 41172 telephone nov thu 112 1 999 0
## 41179 cellular nov thu 483 2 6 3
## 41183 cellular nov fri 112 1 9 1
## poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m
## 41168 nonexistent -1.1 94.767 -50.8 1.030
## 41170 nonexistent -1.1 94.767 -50.8 1.030
## 41172 nonexistent -1.1 94.767 -50.8 1.031
## 41179 success -1.1 94.767 -50.8 1.031
## 41183 success -1.1 94.767 -50.8 1.028
## nr.employed y Prob
## 41168 4963.6 no 0.4207892
## 41170 4963.6 no 0.3402788
## 41172 4963.6 yes 0.1583401
## 41179 4963.6 yes 0.9007417
## 41183 4963.6 no 0.5333138
```

Evaluate the accuracy of correctly identify a customer who subscribed a term deposit

```
bank_test$Pred <- ifelse(bank_test$Prob > 0.5,"yes","no")
length(which(bank_test$Pred == bank_test$y))/nrow(bank_test)
```

`## [1] 0.9042243`

This session introduces logistic regression techniques on ordinal data. Ordinal variable is a class variable with preference.

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine, and the task is to predict the quality of wines using a score between 0 and 10. The input variables in the two datasets are given in the followings:

- 1 - fixed acidity
- 2 - volatile acidity
- 3 - citric acid
- 4 - residual sugar
- 5 - chlorides
- 6 - free sulfur dioxide
- 7 - total sulfur dioxide
- 8 - density
- 9 - pH
- 10 - sulphates
- 11 - alcohol

```
red_wines <- read.csv("winequality-red.csv",header=TRUE,stringsAsFactors=FALSE)
white_wines <- read.csv("winequality-white.csv",header=TRUE,stringsAsFactors=FALSE)
red_wines$type <- "red"
white_wines$type <- "white"
wines <- rbind(red_wines,white_wines)
wines$type <- factor(wines$type)
head(wines,5)
```

```
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## quality type
## 1 5 red
## 2 5 red
## 3 5 red
## 4 6 red
## 5 5 red
```

Summary of the Data

`table(factor(wines$quality))`

```
##
## 3 4 5 6 7 8 9
## 30 216 2138 2836 1079 193 5
```

`sapply(wines[, 1:11], summary)`

```
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. 3.800 0.0800 0.0000 0.600
## 1st Qu. 6.400 0.2300 0.2500 1.800
## Median 7.000 0.2900 0.3100 3.000
## Mean 7.215 0.3397 0.3186 5.443
## 3rd Qu. 7.700 0.4000 0.3900 8.100
## Max. 15.900 1.5800 1.6600 65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## Min. 0.00900 1.00 6.0 0.9871 2.720
## 1st Qu. 0.03800 17.00 77.0 0.9923 3.110
## Median 0.04700 29.00 118.0 0.9949 3.210
## Mean 0.05603 30.53 115.7 0.9947 3.219
## 3rd Qu. 0.06500 41.00 156.0 0.9970 3.320
## Max. 0.61100 289.00 440.0 1.0390 4.010
## sulphates alcohol
## Min. 0.2200 8.00
## 1st Qu. 0.4300 9.50
## Median 0.5100 10.30
## Mean 0.5313 10.49
## 3rd Qu. 0.6000 11.30
## Max. 2.0000 14.90
```

`sapply(wines[, 1:11], sd)`

```
## fixed.acidity volatile.acidity citric.acid
## 1.296433758 0.164636474 0.145317865
## residual.sugar chlorides free.sulfur.dioxide
## 4.757803743 0.035033601 17.749399772
## total.sulfur.dioxide density pH
## 56.521854523 0.002998673 0.160787202
## sulphates alcohol
## 0.148805874 1.192711749
```

Display of Boxplots

```
ggplot(wines, aes(x = quality, y = pH)) +
geom_boxplot(size = .75) +
geom_jitter(alpha = .5) +
facet_grid(~ type, margins = TRUE) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
```