What’s covered in this lecture?

• Classification problems (with examples)
• Logistic regression and Softmax regression …
• Other classifiers: LDA/QDA, KNN …
• Classification errors and measures …

# 1 Classification Problems

When qualitative/categorical response is binary/dichotomous: 0/1, T/F, Yes/No

rm(list = ls())
library(ElemStatLearn)
## Warning: package 'ElemStatLearn' was built under R version 3.4.3
DataX = data.frame(x1=mixture.example$x[,1], x2=mixture.example$x[,2],
y=mixture.example$y) summary(DataX) ## x1 x2 y ## Min. :-2.52082 Min. :-1.99985 Min. :0.0 ## 1st Qu.:-0.07147 1st Qu.: 0.09555 1st Qu.:0.0 ## Median : 0.85970 Median : 0.86139 Median :0.5 ## Mean : 0.78467 Mean : 0.75602 Mean :0.5 ## 3rd Qu.: 1.54344 3rd Qu.: 1.43527 3rd Qu.:1.0 ## Max. : 4.17075 Max. : 2.85581 Max. :1.0 ColMap1 = c(rgb(1,0,0,1), rgb(0,0,1,1)) ColMap2 = c(rgb(1,0,0,.2), rgb(0,0,1,.2)) plot(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19,
xlab="x1", ylab="x2",
main = "Two-class toy problem from ESLBook")

Multi-class examples: R/G/B, Iris species, 0~9 digits …

FrontMap = c(rgb(1,0,0,1), rgb(0,1,0,1), rgb(0,0,1,1))
BackMap = c(rgb(1,0,0,.2), rgb(0,1,0,0.2), rgb(0,0,1,.2))
plot(iris$Sepal.Length, iris$Sepal.Width,
pch=19, col=FrontMap[iris$Species], xlab="Sepal Length", ylab="Sepal Width", main="Multi-class Problem from Iris dataset") # 2 Logistic Regression The logistic regression model takes a logit link function to link the probabilities with the linear predictor: $\log\left(\frac{p(\mathbf{x})}{1-p(\mathbf{x})}\right) \equiv \eta(\mathbf{x}) = \beta_0 + \beta_1 x_1 + \cdots + \beta_{p-1} x_{p-1} \equiv \mathbf{X}\boldsymbol{\beta}$ where $$p(\mathbf{x}) = \mbox{Pr}(Y=1|\mathbf{x})$$ and $$\eta(\mathbf{x})$$ her presents the log-odds. Other examples of link functions are probit $$\eta = \Phi^{-1}(p)$$ using the inverse normal CDF and cloglog (complementary log-log) $$\eta = \log(-\log(1-p))$$. Conversely, we have $p = \frac{1}{1+e^{-\eta}} ~(\mbox{logit}), ~~ p = \Phi(\eta) ~(\mbox{pogit}),~~ p = 1-e^{-e^{\eta}} ~(\mbox{cloglog}),$ x = seq(-5, 5, length.out = 50) matplot(x, cbind(1/(1+exp(-x)), pnorm(x), 1-exp(-exp(x))), type='l', lty=1, col=c(2,3,4), lwd=2, xlab=expression(eta), ylab="p", main="Link functions for binomial responses") legend("topleft", c("logit", "probit", "cloglog"), lty=1, col=c(2,3,4), lwd=2) abline(h=0.5, lty=3) The R:glm function can be used for fitting binomial regression models and it uses the logit link by default. fit_glm = glm(y~x1+x2, data = DataX, family = "binomial") summary(fit_glm) ## ## Call: ## glm(formula = y ~ x1 + x2, family = "binomial", data = DataX) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.28489 -0.86579 0.05965 0.90614 1.88232 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -0.9780 0.2945 -3.321 0.000897 *** ## x1 -0.1344 0.1372 -0.980 0.327272 ## x2 1.3981 0.2316 6.035 1.59e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 277.26 on 199 degrees of freedom ## Residual deviance: 209.54 on 197 degrees of freedom ## AIC: 215.54 ## ## Number of Fisher Scoring iterations: 4 ## 2.1 Decision boundary When $$p(\mathbf{x})=0.5$$, the log odds $$\eta(\mathbf{x}) = 0$$, so the decision boundary for logistic regression is given by $\hat\beta_0 + \hat\beta_1 x_1 + \cdots + \hat\beta_{p-1} x_{p-1} = 0.$ The decision boundary on two-dimensional space can be visualized in two ways: plot(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19,
xlab="x1", ylab="x2",
main = "Logistic Regression: Decision Boundary")
abline(-coef(fit_glm)[1]/coef(fit_glm)[3], -coef(fit_glm)[2]/coef(fit_glm)[3],
lwd=2, lty=3, col=1)

xgrid = seq(min(DataX$x1),max(DataX$x1),length=301)
ygrid = seq(min(DataX$x2),max(DataX$x2),length=301)
pred_glm = function(x,y) round(predict(fit_glm, newdata=data.frame(x1=x,x2=y), type = "response"))
zgrid = outer(xgrid, ygrid, pred_glm)
image(xgrid, ygrid, zgrid, col=ColMap2,
xlab = "x1",ylab = "x2")
points(DataX$x1, DataX$x2, col=ColMap1[DataXy+1], pch=19) title(main = "Logistic Regression for Classification") ## 2.2 Parameter estimation The logistic regression model paramters are estimated by by the method of maximum likelihood estimation (MLE). Given the data of independent observations $$\{(\mathbf{x}_i, y_i)\}_{i=1}^n$$, the likelihood function $$L(\boldsymbol\beta)$$ and negative log-likelihood function $$l(\boldsymbol\beta)$$ for a model with parameter $$\boldsymbol\beta$$ is defined as \begin{aligned} L(\boldsymbol\beta) &= \prod_{i=1}^n p(y_i|\boldsymbol\beta)\\ l(\boldsymbol\beta) &= -\log L(\boldsymbol\beta) = -\sum_{i=1}^n \log p(y_i|\boldsymbol\beta) \equiv \sum_{i=1}^n l(y_i|\boldsymbol\beta). \end{aligned} The method of MLE (maximum likelihood estimation) is to estimate $$\boldsymbol\beta$$ by maximizing $$L(\boldsymbol\beta)$$ or equivalently minimizing $$l(\boldsymbol\beta)$$. For binomial responses $$y_i$$ and logistic model $$p_i=1/(1+e^{-\eta_i})$$, we provide two kinds of negative log-likelihood formulation: a). For each $$y_i\in\{0,1\}$$ as in standard settings, it is a Bernoulii trial with success probability $$p_i$$, so \begin{aligned} l(y_i) & = -\log p_i^{y_i} (1-p_i)^{1-y_i} \\ & = -y_i\log p_i - (1-y_i)\log(1-p_i) \\ & = y_i\log(1+e^{-\eta_i}) + (1-y_i)\log(1+e^{\eta_i})\\ & = \begin{cases} \log(1+e^{-\eta_i}) & \mbox{if } y_i=1\\ \log(1+e^{\eta_i}) & \mbox{if } y_i=0 \end{cases}\\ & = \log(1+e^{\eta_i}) - y_i\eta_i \end{aligned} b). For each $$y_i\in\{-1,1\}$$ as in machine learning settings, \begin{aligned} l(y_i) & = -\log p_i^{(1+y_i)/2} (1-p_i)^{(1-y_i)/2} = \cdots\\ & = \begin{cases} \log(1+e^{-\eta_i}) & \mbox{if } y_i=1\\ \log(1+e^{\eta_i}) & \mbox{if } y_i=-1 \end{cases}\\ & = \log(1+e^{-y_i\eta_i}) \end{aligned} where $$y_i\eta_i$$ is widely used in margin-based machine learning techniques. Exercise: complete the derivation in b). [Optional] Newton-Raphson algorithm: Setting to zero the derivatives of negative log-likelihood with respect to the unknown paramters $$\boldsymbol{\beta}$$, we obtain the score questions $\frac{\partial l(\boldsymbol\beta)}{\partial \boldsymbol\beta} = \frac{-y_ie^{-y_i\eta_i}}{1+e^{-y_i\eta_i}}\frac{\partial \eta_i}{\partial \boldsymbol\beta} = -y_i (1-p(\mathbf{x}_i))\mathbf{x}_i = 0.$ This can be solved through Newton-Raphson algorithm, an iterative method with the updating rule $\boldsymbol\beta^{\rm new} = \boldsymbol\beta^{\rm old} -\left(\frac{\partial^2 l(\boldsymbol\beta)}{\partial \boldsymbol\beta\partial \boldsymbol\beta ^T} \right)^{-1} \frac{\partial l(\boldsymbol\beta)}{\partial \boldsymbol\beta}$ with the derivatives evaluated at $$\boldsymbol\beta^{\rm old}$$. One may reformulate the above Newton-Raphson steps as iteratively-reweighted least squares (IRLS) problems. It can also be solved through stochastic gradient descent and coordinate descent methods. Such topics are often discussed in the context of optimization, which are not covered here. ## 2.3 Goodness-of-fit The GLM uses deviance as a measure of goodness-of-fit. It is twice the minimized negative likelihood up to a constant. In the R output above, • Residual deviance is a lack-of-fit for the current model, smaller is better. • Null deviance is a lack-of-fit of the reduced model with only the intercept. Besides, the AIC is another goodness-of-fit measure by taking into account the model complexity. # 3 Softmax Regression Softmax regression is a multi-class extension of logistic regression. It is also known as multinomial logit model for nominal responses (as versus ordinal responses that are not discussed here). By taking the class $$j=1$$ as the baseline, \begin{aligned} \mbox{Pr}(Y_i = j|\mathbf{x}_i) & = \frac{\exp(\eta_{ij})}{1+\sum_{j=2}^J\exp(\eta_{ij})}, \ \ j=2,\ldots,J\\ \mbox{Pr}(Y_i = 1|\mathbf{x}_i) & = 1-\sum_{j=2}^J \mbox{Pr}(Y_i = j|\mathbf{x}_i) \end{aligned} where the linear predictor $$\eta_{ij}=\mathbf{x}_i^T\boldsymbol\beta_j$$. To fit such softmax regression, we may use the multinom funciton in R:nnet package. library(nnet) ## Warning: package 'nnet' was built under R version 3.4.3 fit_mnom = multinom(Species ~ Sepal.Length + Sepal.Width, data = iris) ## # weights: 12 (6 variable) ## initial value 164.791843 ## iter 10 value 62.715967 ## iter 20 value 59.808291 ## iter 30 value 55.445984 ## iter 40 value 55.375704 ## iter 50 value 55.346472 ## iter 60 value 55.301707 ## iter 70 value 55.253532 ## iter 80 value 55.243230 ## iter 90 value 55.230241 ## iter 100 value 55.212479 ## final value 55.212479 ## stopped after 100 iterations summary(fit_mnom) ## Call: ## multinom(formula = Species ~ Sepal.Length + Sepal.Width, data = iris) ## ## Coefficients: ## (Intercept) Sepal.Length Sepal.Width ## versicolor -92.09924 40.40326 -40.58755 ## virginica -105.10096 42.30094 -40.18799 ## ## Std. Errors: ## (Intercept) Sepal.Length Sepal.Width ## versicolor 26.27831 9.142717 27.77772 ## virginica 26.37025 9.131119 27.78875 ## ## Residual Deviance: 110.425 ## AIC: 122.425 irisxgrid = seq(min(irisSepal.Length),max(iris$Sepal.Length),length=301) irisygrid = seq(min(iris$Sepal.Width),max(iris$Sepal.Width),length=301) tmpfun = function(x,y) as.numeric(predict(fit_mnom, newdata=data.frame(Sepal.Length=x,Sepal.Width=y))) iriszgrid = outer(irisxgrid, irisygrid, tmpfun) image(irisxgrid, irisygrid, iriszgrid, col=BackMap, xlab = "Sepal.Length",ylab = "Sepal.Width") points(iris$Sepal.Length, iris$Sepal.Width, col=FrontMap[iris$Species], pch=19)
title(main = "Softmax Classifier for Iris Dataset")

# 4 Two Other Methods

## 4.1 Discriminant Analysis

For multi-class observations, suppose each class of observations follow a multivariate Gaussian density $f_j(\mathbf{x}) = \frac{1}{\sqrt{(2\pi)^p| \Sigma_j|}}\exp\left\{-\frac{1}{2}(\mathbf{x}-\mu_j)^T \Sigma_j^{-1}(\mathbf{x}-\mu_j)\right\}$ Using the Bayes theorm and prior probabilities $$\pi_j$$, we may have the following decision functions

1. Linear discriminant function when assuming all the classes have the equal covariance matrices $$\Sigma_k = \Sigma$$: $\delta_j(\mathbf{x}) = \mathbf{x}^T\Sigma^{-1}\mu_j - \frac{1}{2}\mu_j^T\Sigma^{-1}\mu_j + \log\pi_j$
2. Quadratic discriminant function without assuming equal covariance matrix for all classes:
$\delta_j(\mathbf{x}) = - \frac{1}{2}(\mathbf{x}-\mu_j)^T\Sigma_j^{-1}(\mathbf{x}-\mu_j) -\frac{1}{2}\log|\boldsymbol\Sigma_j| + \log\pi_j$

Exercise: Compare the two discrimant functions in terms of their polynomial orders.

library(MASS)
## Warning: package 'MASS' was built under R version 3.4.3
fit_lda = lda(y~x1+x2, data = DataX)
pred_lda = function(x,y) as.numeric(predict(fit_lda, newdata=data.frame(x1=x,x2=y))$class) fit_qda = qda(y~x1+x2, data = DataX) pred_qda = function(x,y) as.numeric(predict(fit_qda, newdata=data.frame(x1=x,x2=y))$class)

par(mfrow=c(1,2))
zgrid = outer(xgrid, ygrid, pred_lda)
image(xgrid, ygrid, zgrid, col=ColMap2,
xlab = "x1",ylab = "x2")
points(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19) title(main = "LDA Classifier") zgrid = outer(xgrid, ygrid, pred_qda) image(xgrid, ygrid, zgrid, col=ColMap2, xlab = "x1",ylab = "x2") points(DataX$x1, DataX$x2, col=ColMap1[DataX$y+1], pch=19)
title(main = "QDA Classifier")