What’s covered in this lecture?


1 Classification Problems

When qualitative/categorical response is binary/dichotomous: 0/1, T/F, Yes/No

rm(list = ls())
library(ElemStatLearn)
DataX = data.frame(x1=mixture.example$x[,1], x2=mixture.example$x[,2], 
                   y=mixture.example$y)
summary(DataX)
##        x1                 x2                 y      
##  Min.   :-2.52082   Min.   :-1.99985   Min.   :0.0  
##  1st Qu.:-0.07147   1st Qu.: 0.09555   1st Qu.:0.0  
##  Median : 0.85970   Median : 0.86139   Median :0.5  
##  Mean   : 0.78467   Mean   : 0.75602   Mean   :0.5  
##  3rd Qu.: 1.54344   3rd Qu.: 1.43527   3rd Qu.:1.0  
##  Max.   : 4.17075   Max.   : 2.85581   Max.   :1.0
ColMap1 = c(rgb(1,0,0,1), rgb(0,0,1,1))
ColMap2 = c(rgb(1,0,0,.2), rgb(0,0,1,.2))
plot(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19,
     xlab="x1", ylab="x2",
     main = "Two-class toy problem from ESLBook")

Multi-class examples: R/G/B, Iris species, 0~9 digits …

FrontMap = c(rgb(1,0,0,1), rgb(0,1,0,1), rgb(0,0,1,1))
BackMap = c(rgb(1,0,0,.2), rgb(0,1,0,0.2), rgb(0,0,1,.2))
plot(iris$Sepal.Length, iris$Sepal.Width, 
     pch=19, col=FrontMap[iris$Species],
     xlab="Sepal Length", ylab="Sepal Width", 
     main="Multi-class Problem from Iris dataset")

2 Logistic Regression

The logistic regression model takes a logit link function to link the probabilities with the linear predictor: \[ \log\left(\frac{p(\mathbf{x})}{1-p(\mathbf{x})}\right) \equiv \eta(\mathbf{x}) = \beta_0 + \beta_1 x_1 + \cdots + \beta_{p-1} x_{p-1} \equiv \mathbf{X}\boldsymbol{\beta} \] where \(p(\mathbf{x}) = \mbox{Pr}(Y=1|\mathbf{x})\) and \(\eta(\mathbf{x})\) her presents the log-odds. Other examples of link functions are probit \(\eta = \Phi^{-1}(p)\) using the inverse normal CDF and cloglog (complementary log-log) \(\eta = \log(-\log(1-p))\). Conversely, we have \[ p = \frac{1}{1+e^{-\eta}} ~(\mbox{logit}), ~~ p = \Phi(\eta) ~(\mbox{pogit}),~~ p = 1-e^{-e^{\eta}} ~(\mbox{cloglog}), \]

x = seq(-5, 5, length.out = 50)
matplot(x, cbind(1/(1+exp(-x)), pnorm(x), 1-exp(-exp(x))), 
        type='l', lty=1, col=c(2,3,4), lwd=2,
        xlab=expression(eta), ylab="p", 
        main="Link functions for binomial responses")
legend("topleft", c("logit", "probit", "cloglog"), 
       lty=1, col=c(2,3,4), lwd=2)
abline(h=0.5, lty=3)

The R:glm function can be used for fitting binomial regression models and it uses the logit link by default.

fit_glm = glm(y~x1+x2, data = DataX, family = "binomial")
summary(fit_glm)
## 
## Call:
## glm(formula = y ~ x1 + x2, family = "binomial", data = DataX)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.28489  -0.86579   0.05965   0.90614   1.88232  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.9780     0.2945  -3.321 0.000897 ***
## x1           -0.1344     0.1372  -0.980 0.327272    
## x2            1.3981     0.2316   6.035 1.59e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 277.26  on 199  degrees of freedom
## Residual deviance: 209.54  on 197  degrees of freedom
## AIC: 215.54
## 
## Number of Fisher Scoring iterations: 4

2.1 Decision boundary

When \(p(\mathbf{x})=0.5\), the log odds \(\eta(\mathbf{x}) = 0\), so the decision boundary for logistic regression is given by \[\hat\beta_0 + \hat\beta_1 x_1 + \cdots + \hat\beta_{p-1} x_{p-1} = 0.\] The decision boundary on two-dimensional space can be visualized in two ways:

plot(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19,
     xlab="x1", ylab="x2",
     main = "Logistic Regression: Decision Boundary")
abline(-coef(fit_glm)[1]/coef(fit_glm)[3], -coef(fit_glm)[2]/coef(fit_glm)[3],
       lwd=2, lty=3, col=1)

xgrid = seq(min(DataX$x1),max(DataX$x1),length=301)
ygrid = seq(min(DataX$x2),max(DataX$x2),length=301)
pred_glm = function(x,y) round(predict(fit_glm, newdata=data.frame(x1=x,x2=y), type = "response"))
zgrid = outer(xgrid, ygrid, pred_glm)
image(xgrid, ygrid, zgrid, col=ColMap2,
      xlab = "x1",ylab = "x2")
points(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19)
title(main = "Logistic Regression for Classification")

2.2 Parameter estimation

The logistic regression model paramters are estimated by by the method of maximum likelihood estimation (MLE). Given the data of independent observations \(\{(\mathbf{x}_i, y_i)\}_{i=1}^n\), the likelihood function \(L(\boldsymbol\beta)\) and negative log-likelihood function \(l(\boldsymbol\beta)\) for a model with parameter \(\boldsymbol\beta\) is defined as \[ \begin{aligned} L(\boldsymbol\beta) &= \prod_{i=1}^n p(y_i|\boldsymbol\beta)\\ l(\boldsymbol\beta) &= -\log L(\boldsymbol\beta) = -\sum_{i=1}^n \log p(y_i|\boldsymbol\beta) \equiv \sum_{i=1}^n l(y_i|\boldsymbol\beta). \end{aligned} \] The method of MLE (maximum likelihood estimation) is to estimate \(\boldsymbol\beta\) by maximizing \(L(\boldsymbol\beta)\) or equivalently minimizing \(l(\boldsymbol\beta)\).

For binomial responses \(y_i\) and logistic model \(p_i=1/(1+e^{-\eta_i})\), we provide two kinds of negative log-likelihood formulation:

a). For each \(y_i\in\{0,1\}\) as in standard settings, it is a Bernoulii trial with success probability \(p_i\), so \[ \begin{aligned} l(y_i) & = -\log p_i^{y_i} (1-p_i)^{1-y_i} \\ & = -y_i\log p_i - (1-y_i)\log(1-p_i) \\ & = y_i\log(1+e^{-\eta_i}) + (1-y_i)\log(1+e^{\eta_i})\\ & = \begin{cases} \log(1+e^{-\eta_i}) & \mbox{if } y_i=1\\ \log(1+e^{\eta_i}) & \mbox{if } y_i=0 \end{cases}\\ & = \log(1+e^{\eta_i}) - y_i\eta_i \end{aligned} \] b). For each \(y_i\in\{-1,1\}\) as in machine learning settings,
\[ \begin{aligned} l(y_i) & = -\log p_i^{(1+y_i)/2} (1-p_i)^{(1-y_i)/2} = \cdots\\ & = \begin{cases} \log(1+e^{-\eta_i}) & \mbox{if } y_i=1\\ \log(1+e^{\eta_i}) & \mbox{if } y_i=-1 \end{cases}\\ & = \log(1+e^{-y_i\eta_i}) \end{aligned} \] where \(y_i\eta_i\) is widely used in margin-based machine learning techniques.

Exercise: complete the derivation in b).

[Optional] Newton-Raphson algorithm: Setting to zero the derivatives of negative log-likelihood with respect to the unknown paramters \(\boldsymbol{\beta}\), we obtain the score questions \[ \frac{\partial l(\boldsymbol\beta)}{\partial \boldsymbol\beta} = \frac{-y_ie^{-y_i\eta_i}}{1+e^{-y_i\eta_i}}\frac{\partial \eta_i}{\partial \boldsymbol\beta} = -y_i (1-p(\mathbf{x}_i))\mathbf{x}_i = 0. \] This can be solved through Newton-Raphson algorithm, an iterative method with the updating rule \[ \boldsymbol\beta^{\rm new} = \boldsymbol\beta^{\rm old} -\left(\frac{\partial^2 l(\boldsymbol\beta)}{\partial \boldsymbol\beta\partial \boldsymbol\beta ^T} \right)^{-1} \frac{\partial l(\boldsymbol\beta)}{\partial \boldsymbol\beta} \] with the derivatives evaluated at \(\boldsymbol\beta^{\rm old}\).

One may reformulate the above Newton-Raphson steps as iteratively-reweighted least squares (IRLS) problems. It can also be solved through stochastic gradient descent and coordinate descent methods. Such topics are often discussed in the context of optimization, which are not covered here.

2.3 Goodness-of-fit

The GLM uses deviance as a measure of goodness-of-fit. It is twice the minimized negative likelihood up to a constant. In the R output above,

  • Residual deviance is a lack-of-fit for the current model, smaller is better.
  • Null deviance is a lack-of-fit of the reduced model with only the intercept.

Besides, the AIC is another goodness-of-fit measure by taking into account the model complexity.

3 Softmax Regression

Softmax regression is a multi-class extension of logistic regression. It is also known as multinomial logit model for nominal responses (as versus ordinal responses that are not discussed here). By taking the class \(j=1\) as the baseline, \[ \begin{aligned} \mbox{Pr}(Y_i = j|\mathbf{x}_i) & = \frac{\exp(\eta_{ij})}{1+\sum_{j=2}^J\exp(\eta_{ij})}, \ \ j=2,\ldots,J\\ \mbox{Pr}(Y_i = 1|\mathbf{x}_i) & = 1-\sum_{j=2}^J \mbox{Pr}(Y_i = j|\mathbf{x}_i) \end{aligned} \] where the linear predictor \(\eta_{ij}=\mathbf{x}_i^T\boldsymbol\beta_j\).

To fit such softmax regression, we may use the multinom funciton in R:nnet package.

library(nnet)
fit_mnom = multinom(Species ~ Sepal.Length + Sepal.Width, data = iris)
## # weights:  12 (6 variable)
## initial  value 164.791843 
## iter  10 value 62.715967
## iter  20 value 59.808291
## iter  30 value 55.445984
## iter  40 value 55.375704
## iter  50 value 55.346472
## iter  60 value 55.301707
## iter  70 value 55.253532
## iter  80 value 55.243230
## iter  90 value 55.230241
## iter 100 value 55.212479
## final  value 55.212479 
## stopped after 100 iterations
summary(fit_mnom)
## Call:
## multinom(formula = Species ~ Sepal.Length + Sepal.Width, data = iris)
## 
## Coefficients:
##            (Intercept) Sepal.Length Sepal.Width
## versicolor   -92.09924     40.40326   -40.58755
## virginica   -105.10096     42.30094   -40.18799
## 
## Std. Errors:
##            (Intercept) Sepal.Length Sepal.Width
## versicolor    26.27831     9.142717    27.77772
## virginica     26.37025     9.131119    27.78875
## 
## Residual Deviance: 110.425 
## AIC: 122.425
irisxgrid = seq(min(iris$Sepal.Length),max(iris$Sepal.Length),length=201)
irisygrid = seq(min(iris$Sepal.Width),max(iris$Sepal.Width),length=201)
tmpfun = function(x,y) as.numeric(predict(fit_mnom, newdata=data.frame(Sepal.Length=x,Sepal.Width=y)))

iriszgrid = outer(irisxgrid, irisygrid, tmpfun)
image(irisxgrid, irisygrid, iriszgrid, col=BackMap,
      xlab = "Sepal.Length",ylab = "Sepal.Width")
points(iris$Sepal.Length, iris$Sepal.Width, col=FrontMap[iris$Species], pch=19)
title(main = "Softmax Classifier for Iris Dataset")

4 Two Other Methods

4.1 Discriminant Analysis

For multi-class observations, suppose each class of observations follow a multivariate Gaussian density \[ f_j(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^p| \Sigma_j|}}\exp\left\{-\frac{1}{2}(\mathbf{x}-\mu_j)^T \Sigma_j^{-1}(\mathbf{x}-\mu_j)\right\} \] Using the Bayes theorm and prior probabilities \(\pi_j\), we may have the following decision functions

  1. Linear discriminant function when assuming all the classes have the equal covariance matrices \(\Sigma_k = \Sigma\): \[ \delta_j(\mathbf{x}) = \mathbf{x}^T\Sigma^{-1}\mu_j - \frac{1}{2}\mu_j^T\Sigma^{-1}\mu_j + \log\pi_j \]
  2. Quadratic discriminant function without assuming equal covariance matrix for all classes:
    \[ \delta_j(\mathbf{x}) = - \frac{1}{2}(\mathbf{x}-\mu_j)^T\Sigma_j^{-1}(\mathbf{x}-\mu_j) -\frac{1}{2}\log|\boldsymbol\Sigma_j| + \log\pi_j \]

Exercise: Compare the two discrimant functions in terms of their polynomial orders.

library(MASS)
fit_lda = lda(y~x1+x2, data = DataX)
pred_lda = function(x,y) as.numeric(predict(fit_lda, newdata=data.frame(x1=x,x2=y))$class)
fit_qda = qda(y~x1+x2, data = DataX)
pred_qda = function(x,y) as.numeric(predict(fit_qda, newdata=data.frame(x1=x,x2=y))$class)

par(mfrow=c(1,2))
zgrid = outer(xgrid, ygrid, pred_lda)
image(xgrid, ygrid, zgrid, col=ColMap2,
      xlab = "x1",ylab = "x2")
points(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19)
title(main = "LDA Classifier")

zgrid = outer(xgrid, ygrid, pred_qda)
image(xgrid, ygrid, zgrid, col=ColMap2,
      xlab = "x1",ylab = "x2")
points(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19)
title(main = "QDA Classifier")