# 1 Introduction

First, let’s start with logistic function. A logistic function take any real input $$t$$ and output values between zero and one. Thus, it can be interpretable as a probability.

$\sigma(t) =\frac{1}{1+e^{-t}}$

Next, change $$t$$ with some explanatory variable $$x$$, e.g., $$t=\beta _{0}+\beta _{1}x$$, we obtain:

$p(x)={\frac {1}{1+e^{-(\beta _{0}+\beta _{1}x)}}}$

where $$p(x) = P(Y_{i}=1\mid x)$$ and $$e^{\beta _{0}+\beta _{1}x}$$ is the odds of an event. By definition, odds is the probability that an event happens over the probability it doesn’t happen. In logistic regression, we model the log odds as a linear function:

$log\big(\frac{p(x)}{1-p(x)}\big) = \beta _0 + \beta _{1}x$

Also, it can be extended to multiple variables, as:

$p(x)={\frac {1}{1+e^{-(\beta _{0}+\beta _{1}x_{1}+\beta _{2}x_{2}+\cdots +\beta _{m}x_{m})}}}$

The differnce between linear regression and logistic regression:

• In linear regression, we deal with continuous response and the error term needs to be zero mean normal distribution. (However, this assumption may be not appropriate in some problems)
• In logistic regression, we model the probability of the binary response following a Bernoulli distribution.

(Image from: www.machinelearningplus.com/logistic-regression-tutorial-examples-r/)

Several things need to notice:

• We no longer have an assumption of normality;

• Estimation for the $$\hat{\beta}_j$$ will come from maximum likelihood estimation assuming Bernoulli (or binomial) distributed responses;

• Holding all other predictor variables constant, a change of 1 in the $$j_th$$ predictor will result in a change of $$exp(\hat{\beta}_j)$$ in the odds;

• The odds ratio between two sets of conditions (values for the predictors variables) will tell us how much greater (or less) the odds of an event is under one set of conditions than under another (e.g. how many times greater is the odds for females to encounter a specific event than for males, or how much greater is the odds for an event to happen if a continuous predictor changes;

• Confidence intervals for the $$\hat{\beta}_j$$ will be symmetric, but confidence intervals for the odds and odds ratios will not be symmetric because they are based on exponentials of the symmetric intervals.

Some extension:

• Assuming binomial distributions instead of Bernoulli (recall that Bernoulli is a binomial with n=1);
• Softmax regression as a multi-class extension of logistic regression.

# 2 Example 1: Logistic regression for Bank Marketing Prediction

A Bank Marketing Dataset for predicting whether a client will subscrib a term deposit (binary: “yes”,“no”). With accurate prediction, this model could be used to select target clients and reduce cost.

## 2.1 Data Introduction

Client Basic Information:

• 1 - age (numeric)
• 2 - job: (categorical：“blue-collar”, “entrepreneur”, “student”, “services”, “unemployed”, …)
• 3 - marital: marital status (categorical: “divorced”, “married”, “single”, “unknown”)
• 4 - education (categorical: “basic9y”, “high.school”, “university.degree”, “illiterate”, …)
• 5 - default: has credit in default? (categorical: “no”,“yes”,“unknown”)
• 6 - housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
• 7 - loan: has personal loan? (categorical: “no”,“yes”,“unknown”)

Last Contact Information:

• 8 - contact: contact communication type (categorical: “cellular”,“telephone”)
• 9 - month: last contact month of year (categorical: “Jan”, “Fb”, “Mar”,…)
• 10 - day_of_week: last contact day of the week (categorical: “Mon”,“Tue”,“Wed”,“Thu”,“Fri”)
• 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

• 12 - campaign: number of contacts performed during this campaign and for this client
• 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
• 14 - previous: number of contacts performed before this campaign and for this client (numeric)
• 15 - poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”)

Social and Economic Context Attributes

• 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
• 17 - cons.price.idx: consumer price index - monthly indicator (numeric)
• 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
• 19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
• 20 - nr.employed: number of employees - quarterly indicator (numeric)

## 2.2 Experients

library(dplyr)
library(dummies)
require(foreign)
require(ggplot2)
require(MASS)
require(Hmisc)
require(reshape2)
library(glmnet)
options(warn=-1) #suppress warnings
bank_additional_full <- read.csv("bank-additional-full.csv",header=TRUE)
colnames(bank_additional_full)
##  [1] "age"            "job"            "marital"        "education"
##  [5] "default"        "housing"        "loan"           "contact"
##  [9] "month"          "day_of_week"    "duration"       "campaign"
## [13] "pdays"          "previous"       "poutcome"       "emp.var.rate"
## [17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"
## [21] "y"

Divide the training and testing set

sample_index = sample(1:nrow(bank_additional_full),floor(nrow(bank_additional_full)*0.80))
bank_test = bank_additional_full[-sample_index,]

Form Data Matrix

x <- data.matrix(bank_train[,-ncol(bank_train)])
y <- data.matrix(bank_train[,ncol(bank_train)])

Use glmnet to perform regularization and find parameter estimates

logmodel <- cv.glmnet(x,y, alpha=1,family="binomial")
coef <- as.matrix(coef(logmodel,s=logmodel$lambda.min)) To construct prediction probability of “yes” tx <- data.matrix(bank_test[,-ncol(bank_test)]) p <- vector("numeric",length=0) for (i in 1:nrow(tx)) { s <- sum(tx[i,]*coef[2:length(coef)]) p[i] <- exp(coef[1] + s)/(1+exp(coef[1] + s)) } bank_test$Prob <- p

Display portion of the testing dataset

tail(bank_test,5)
##       age        job marital           education default housing loan
## 41168  32     admin. married   university.degree      no     yes   no
## 41170  62   services married         high.school      no     yes   no
## 41172  33    student married professional.course      no     yes   no
## 41179  62    retired married   university.degree      no      no   no
## 41183  29 unemployed  single            basic.4y      no     yes   no
##         contact month day_of_week duration campaign pdays previous
## 41168  cellular   nov         wed      236        3   999        0
## 41170  cellular   nov         wed      154        5   999        0
## 41172 telephone   nov         thu      112        1   999        0
## 41179  cellular   nov         thu      483        2     6        3
## 41183  cellular   nov         fri      112        1     9        1
##          poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m
## 41168 nonexistent         -1.1         94.767         -50.8     1.030
## 41170 nonexistent         -1.1         94.767         -50.8     1.030
## 41172 nonexistent         -1.1         94.767         -50.8     1.031
## 41179     success         -1.1         94.767         -50.8     1.031
## 41183     success         -1.1         94.767         -50.8     1.028
##       nr.employed   y      Prob
## 41168      4963.6  no 0.4207892
## 41170      4963.6  no 0.3402788
## 41172      4963.6 yes 0.1583401
## 41179      4963.6 yes 0.9007417
## 41183      4963.6  no 0.5333138

Evaluate the accuracy of correctly identify a customer who subscribed a term deposit

bank_test$Pred <- ifelse(bank_test$Prob > 0.5,"yes","no")
length(which(bank_test$Pred == bank_test$y))/nrow(bank_test)
## [1] 0.9042243

# 3 Example 2: Logistic Regression on Ordinal Data

This session introduces logistic regression techniques on ordinal data. Ordinal variable is a class variable with preference.

## 3.1 Data Introduction

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine, and the task is to predict the quality of wines using a score between 0 and 10. The input variables in the two datasets are given in the followings:

• 1 - fixed acidity
• 2 - volatile acidity
• 3 - citric acid
• 4 - residual sugar
• 5 - chlorides
• 6 - free sulfur dioxide
• 7 - total sulfur dioxide
• 8 - density
• 9 - pH
• 10 - sulphates
• 11 - alcohol
red_wines <- read.csv("winequality-red.csv",header=TRUE,stringsAsFactors=FALSE)

red_wines$type <- "red" white_wines$type <- "white"

wines <- rbind(red_wines,white_wines)
wines$type <- factor(wines$type)
head(wines,5)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
##   quality type
## 1       5  red
## 2       5  red
## 3       5  red
## 4       6  red
## 5       5  red

Summary of the Data

table(factor(wines\$quality))
##
##    3    4    5    6    7    8    9
##   30  216 2138 2836 1079  193    5
sapply(wines[, 1:11], summary)
##         fixed.acidity volatile.acidity citric.acid residual.sugar
## Min.            3.800           0.0800      0.0000          0.600
## 1st Qu.         6.400           0.2300      0.2500          1.800
## Median          7.000           0.2900      0.3100          3.000
## Mean            7.215           0.3397      0.3186          5.443
## 3rd Qu.         7.700           0.4000      0.3900          8.100
## Max.           15.900           1.5800      1.6600         65.800
##         chlorides free.sulfur.dioxide total.sulfur.dioxide density    pH
## Min.      0.00900                1.00                  6.0  0.9871 2.720
## 1st Qu.   0.03800               17.00                 77.0  0.9923 3.110
## Median    0.04700               29.00                118.0  0.9949 3.210
## Mean      0.05603               30.53                115.7  0.9947 3.219
## 3rd Qu.   0.06500               41.00                156.0  0.9970 3.320
## Max.      0.61100              289.00                440.0  1.0390 4.010
##         sulphates alcohol
## Min.       0.2200    8.00
## 1st Qu.    0.4300    9.50
## Median     0.5100   10.30
## Mean       0.5313   10.49
## 3rd Qu.    0.6000   11.30
## Max.       2.0000   14.90
sapply(wines[, 1:11], sd)
##        fixed.acidity     volatile.acidity          citric.acid
##          1.296433758          0.164636474          0.145317865
##       residual.sugar            chlorides  free.sulfur.dioxide
##          4.757803743          0.035033601         17.749399772
## total.sulfur.dioxide              density                   pH
##         56.521854523          0.002998673          0.160787202
##            sulphates              alcohol
##          0.148805874          1.192711749

Display of Boxplots

ggplot(wines, aes(x = quality, y = pH)) +
geom_boxplot(size = .75) +
geom_jitter(alpha = .5) +
facet_grid(~ type, margins = TRUE) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))