---
title: 'Lecture 6: Regularization'
author: '[Dr. Aijun Zhang](http://www.statsoft.org) - [STAT3612 Data Mining](http://www.statsoft.org/teaching/stat3612/)'
date: "Feb 20 & 23, 2017"
output:
html_document:
highlight: tango
number_sections: yes
theme: paper
toc: yes
toc_depth: 3
html_notebook: default
---
```{r setup, include=FALSE}
options(width=100)
knitr::opts_chunk$set(echo = TRUE, message=FALSE, message=FALSE)
```
**Regularization, a journey from smoothness to sparsity ... **
```{r echo=FALSE, fig.align="center", out.width = "180px", eval=T, fig.cap=c("Grace Wabha: Spline Models for Observational Data (1990)")}
knitr::include_graphics(c('GraceWahba.jpg'))
```
```{r echo=FALSE, fig.align="center", out.width = "600px", eval=T, fig.cap=c("Trevor Hastie, Robert Tibshirani, Jerome Friedman: The Elements of Statistical Learning (2001; 2009)")}
knitr::include_graphics(c('HastieTibshiraniFriedman.png'))
```
**Regularization**, in mathematics and statistics and particularly in the fields of machine learning and inverse problems, refers to a process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting . From [Wikipedia](https://en.wikipedia.org/wiki/Regularization_(mathematics)).
The general form of regularization problem:
$$
\min_{f\in \mathcal{H}}\sum_{i=1}^n L(y_i, f(x_i)) + \lambda \Omega(f)
$$
with the loss function $L(y,f(x))$ (e.g. square loss), the penalty functional $\Omega(f)$ (typically on the model complexity, e.g. smoothness/sparisity), and the hyperparameter (aka smoothing parameter) $\lambda\geq 0$ that controls the importance of the regularization term (trade-off between goodness-of-fit and smoothness).
***
# Smoothing Spline
```{r fig.width=8, fig.asp=0.7, fig.align="center"}
ffun = function(x) exp(-(x-3)^2)
x = seq(0.1,4,by=0.1)
set.seed(2017)
y = ffun(x) + 0.1*rnorm(length(x))
plot(x,y, pch=19, main="Smoothing Spline")
lines(smooth.spline(x, y, spar = 0.95), col=4)
lines(smooth.spline(x, y, spar = 0.05), col=3)
tmp = smooth.spline(x, y, cv = TRUE)
lines(tmp, col=2)
legend("topleft", paste("lambda =", c(0.95,0.05, round(tmp$spar,2)), ":",
c("Underfitting", "Overfitting", "Cross-Validated")),
col = c(4,3,2),lty=1)
```
Smoothing spline becomes a basic tool for nonparametric smoothing. Unlike regression splines with knot selection (controlling the number of basis functions), the smoothing spline use many basis functions by taking each distinct observation as a knot, then control the degree of smoothness through penalizing the roughness/curvature characteristics of the target function.
$$
\min_{f\in\mathcal{H}} \sum_{i=1}^n[y_i - f(x)]^2+ \lambda \int |f''(u)|^2du
$$
* When $\lambda = 0$, there is no smoothing, it becomes the interpolation problem;
* When $\lambda = \infty$, $|f''(x)|=0$, it results in a line.
It is known that the unique minimizer is a natural cubic spline with knots at the unique $x_i$, so we may write $f(x) = \boldsymbol\beta^T\boldsymbol{\phi}(x)$ through use of B-spline bases. Then, the regularized optimization problem becomes
$$
\min_{\boldsymbol\beta} (\mathbf{y}-\boldsymbol\Phi\boldsymbol\beta)^T(\mathbf{y}-\boldsymbol\Phi\boldsymbol\beta) +
\lambda \boldsymbol\beta^T\boldsymbol\Omega\boldsymbol\beta
$$
where $\Omega_{ij} = \int \ddot{\phi}_i(x)\ddot{\phi}_j(x)dx$. The solution is a generalized ridge estimator (to be discussed below):
$$
\boldsymbol{\hat\beta} = (\boldsymbol\Phi^T\boldsymbol\Phi + \lambda \boldsymbol\Omega)^{-1}\boldsymbol\Phi^T\mathbf{y}
$$
and the smooth prediction is given by
$$
\hat{\mathbf{f}} = \boldsymbol\Phi(\boldsymbol\Phi^T\boldsymbol\Phi + \lambda \boldsymbol\Omega)^{-1}\boldsymbol\Phi^T\mathbf{y} \equiv \mathbf{S}_\lambda \mathbf{y}
$$
Here, $\mathbf{S}_\lambda = \boldsymbol\Phi(\boldsymbol\Phi^T\boldsymbol\Phi + \lambda \boldsymbol\Omega)^{-1}\boldsymbol\Phi^T$ is called the smoothing matrix. Compare it to the hat matrix in linear regression.
***
## Regularization Paths
For different choices of smoothing parameter, smoothing spline solutions are all determined by natural cubic splines with same knots (thus the same basis expansion $\Phi$), so we may find relationship/behavior of basis coefficient estimates against the reguarlization parameters. This is the concept of "regularization path" and it plays a key role in model regularization.
```{r fig.width=8, fig.asp=0.7, fig.align="center"}
names(smooth.spline(x, y, spar = 0.5)$fit)
ss = seq(0, 1, by=0.02)
bb = NULL
for (k in 1:length(ss)) {
tmp = smooth.spline(x, y, spar=ss[k])
bb = cbind(bb, tmp$fit$coef)
}
matplot(ss, t(bb), type='l', lty=1, lwd = 1, col=4,
xlab = "Smoothing parameter", ylab = "Coefficients",
main = 'Smoothing Spline: Regularization Path')
abline(v= smooth.spline(x, y, cv = TRUE)$spar, col=2, lty=3,lwd=1)
abline(h=0, lty=3, col=1)
```
Understanding the regularization paths is critical for hyperparameter selection. Alghouth this is not so obvious here in nonparametric smoothing, it is more imporantant to the sparse linear modeling below.
***
## Hyperparameter Selection
Choosing the smoothing parameter is often by the method of cross validation, a method we will discuss in the chapter of model assessment.
```{r echo=FALSE, fig.align="center", out.width = "500px", eval=T, fig.cap=c("")}
knitr::include_graphics(c('CV5fold.png'))
```
As a preview, it is to
1. Split data into K (e.g. 5, 10, n) folds
2. For each fold k=1,..., K,
i) fit model based on the remaining K-1 folds of data
i) evaluate the fitted model on the left-out fold
3. Take the average risk (i.e. SSE) as the cross-validation score
The method of cross-vadidation is to search for best choice of hyperparameter that minimize the objective score.
For smoothing spline, it usually adopts the leave-one-out scheme (i.e. n-fold cross-validation). For $i=1,\ldots,n$, let $\hat{f}^{[i]}(x_i)$ denote the prediction at $x_i$ based on the leave-one-out sample $\{(x_j, y_j)\}_{j\neq i}$, then define the score of LOOCV (leave-one-out cross-validation) by
$$
{\rm LOOCV}(\lambda) = \frac{1}{n}\sum_{i=1}^n \Big (y_i - \hat{f}^{[i]}(x_i)\Big)^2
$$
Upon some relaxation, the CV score can be approximated by a so-called **generalized cross-validation (GCV)** score
$$
{\rm GCV}(\lambda) = \frac{1}{n}\sum_{i=1}^n \left(\frac{y_i - \hat{f}(x_i)}{1-{\rm tr}(\mathbf{S}_\lambda)/n}\right)^2
$$
Exercise: compare the computational complexity of these two scores.
```{r fig.align="center"}
ss = seq(0, 1, by=0.02)
gcv = NULL
for (k in 1:length(ss)) {
tmp = smooth.spline(x, y, spar=ss[k])
gcv = c(gcv, tmp$cv.crit)
}
plot(ss, gcv, type='b',
xlab="lambda", ylab="GCV",
main="GCV Selection of Smoothing Parameter")
abline(v=ss[which.min(gcv)],col=2,lty=3,lwd=1)
```
***
# Shrinkage Methods
Now let's consider the multiple linear regression model. When there are many predictors, the model becomes complex and its interpretation can be difficult due to collinearity in predictors. The prediction performance can also be degraded if using too many predictors. In this chapter we study the method of regularization as a way to reduce model complexity.
$$
y = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p + \varepsilon
$$
for some large number of $p$. We study the shrinkage methods that directly shrink the regression coefficients $\beta_j$'s towards zero.
## Ridge Regression
Ridge regression is similar to the least squares, except for its objective taking an additional penalty term
$$
\min_{\boldsymbol\beta} (\mathbf{y}-\mathbf{X}\boldsymbol\beta)^T(\mathbf{y}-\mathbf{X}\boldsymbol\beta) + \lambda \|\boldsymbol\beta\|^2
$$
where $\|\boldsymbol\beta\|^2 = \boldsymbol\beta^T\boldsymbol\beta = \sum_{j=1}^p\beta_j^2$. Same as derivation of least squares estimate, we obtain the ridge estimator through regularized least squares:
$$
\boldsymbol{\hat\beta} = (\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T\mathbf{y}
$$
And the linear predictor $\mathbf{\hat y} = \mathbf{S}_\lambda \mathbf{y}$ with the smoothing (or hat) matrix
$$
\mathbf{S}_\lambda = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T
$$
Same as the smoothing spline example above, let's play with ridge regression using the expanded iris dataset, with focus on the visualization of the solution paths.
```{r fig.width=10, fig.asp=0.5, fig.align="center"}
library(MASS)
DataX = iris
DataX$SL2 = DataX$Sepal.Length^2
DataX$SW2 = DataX$Sepal.Width^2
DataX$PL2 = DataX$Petal.Length^2
DataX$PW2 = DataX$Petal.Width^2
ss = seq(0,10,by=0.2)
rgfit = lm.ridge(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
+ SW2 + PL2 + PW2, data = DataX, lambda = ss)
par(mfrow=c(1,2))
matplot(ss, t(rgfit$coef), type='l', lty=1, xlim=c(0,max(ss)*1.2), col=4,
xlab = "Ridge Parameter", ylab="Coefficients",
main = 'Ridge Regression: Solution Path')
text(max(ss), tail(t(rgfit$coef),1), colnames(t(rgfit$coef)), pos=4, cex=0.6)
abline(h=0, lty=3)
plot(ss, rowSums(t(rgfit$coef)*t(rgfit$coef)), type='b',
xlab = "Ridge Parameter", ylab="L2 Penalty",
main = 'Ridge Regression: L2 Penalty')
```
Note that the ridge regression is often used for controlling the variance of parameter estimates, which can go wild for ill-posed or near-degenerated problems.
Exercise: find the ridge estimator when with $\lambda = 0$ and $\infty$, and analyze their bias and variance.
## LASSO
The lasso method is very similiar to the ridge regression with change from L2 norm to L1 norm:
$$
\min_{\boldsymbol\beta} (\mathbf{y}-\mathbf{X}\boldsymbol\beta)^T(\mathbf{y}-\mathbf{X}\boldsymbol\beta) + \lambda \|\boldsymbol\beta\|_1
$$
where $\|\boldsymbol\beta\|_1 = \sum_{j=1}^p |\beta_j|$, as introduced by Tibshirani (1996). It has additional power of automatic variable selection.
```{r fig.width=10, fig.asp=0.5, fig.align="center"}
library(glmnet)
ss = exp(seq(-10, 0, by=0.1))
tmpx = as.matrix(DataX[, c("Sepal.Width", "Petal.Length", "Petal.Width", "SW2", "PL2", "PW2")])
lassofit = glmnet(tmpx, DataX$Sepal.Length, lambda = ss)
par(mfrow=c(1,2))
plot.glmnet(lassofit, xvar="lambda", label=T)
plot.glmnet(lassofit, xvar="norm", label=T)
```
**[Optional]** The above plot.glmnet has fixed options set by the package. If we want to change the text labels with variable font size, the following works well.
```{r fig.align="center"}
lbs_fun <- function(fit, ...) {
L <- length(fit$lambda)
x <- log(fit$lambda[L])
y <- fit$beta[, L]
labs <- names(y)
text(x, y, labels=labs, ...)
}
plot(lassofit, xvar="lambda", label=T)
lbs_fun(lassofit, cex=0.8, pos=4)
```
The magic that lasso can do variable selection is due to the power of L1 penalty in capturing model sparsity. See the following charts that compare L2 norm (ridge regression) and L1 norm (LASSO).
```{r echo=FALSE, fig.align="center", out.width = "800px", eval=T, fig.cap=c("")}
knitr::include_graphics(c('LassoRidge.png'))
```
## Elastic Net
This is a composite of Lasso and Ridge regression, by allowing an additional parameter $\alpha \in [0,1]$:
$$
\min_{\boldsymbol\beta} (\mathbf{y}-\mathbf{X}\boldsymbol\beta)^T(\mathbf{y}-\mathbf{X}\boldsymbol\beta) + \lambda \Big(\alpha\|\boldsymbol\beta\|_1 + (1-\alpha)\|\boldsymbol\beta\|^2\Big)
$$
```{r echo=FALSE, fig.align="center", out.width = "500px", eval=T, fig.cap=c("")}
knitr::include_graphics(c('ElasticNet.png'))
```
We may play `glmnet` with different alpha choices.
```{r fig.align="center"}
library(glmnet)
tmpx = as.matrix(DataX[, c("Sepal.Width", "Petal.Length", "Petal.Width", "SW2", "PL2", "PW2")])
enetfit = glmnet(tmpx, DataX$Sepal.Length, alpha=0.1)
plot.glmnet(enetfit, xvar="lambda")
```
**R::glmnet** package; see [Glmnet Vignette](http://web.stanford.edu/~hastie/glmnet/glmnet_beta.html) by Hastie and Qian (2016); see also Friedman, Hastie and Tibshirani's paper "[Regularization paths for generalized linear models via coordinate descent](https://www.jstatsoft.org/article/view/v033i01)" in Journal of Statistical Software (2010).
***
Before end of this chapter, I would recommend an excellent textbook about **Sparse Modeling** published recently:
```{r echo=FALSE, fig.align="center", out.width = "360px", eval=T, fig.cap=c("CRC (2015)")}
knitr::include_graphics(c('BookStatLearnSparsity.png'))
```
# Appendix: Optinal Materials
## Wavelet Shrinkage
* R has limited support for wavelets (as compared to MATLAB).
* Here we demonstrate only the simplest Haar basis, optional.
* Imporant reference: Ideal spatial adaptation by wavelet shrinkage (Donoho and Johnstone, 1994 Biometrika).
```{r fig.width=10, fig.asp=1.0, fig.align="center"}
HaarFun = function(x) ifelse(x<0 | x>=1, 0, ifelse(x<1/2, 1, -1))
BasisHaar = function(x, nlevel=3) {
u = (x-min(x))/(diff(range(x))+1e-10)
xphi = HaarFun(u)
xname = 0
for (j in 1:nlevel) {
for (k in seq(0,2^j-1)) xphi = cbind(xphi, 2^(j/2)*HaarFun(2^j*u-k))
xname=c(xname,rep(j, 2^j))
}
colnames(xphi) = xname
return(xphi)
}
ffun2 = function(x) 0.5*sin(3*x) + 0.25*sin(10*x)
x = seq(0,4,length.out=256)
set.seed(2017)
y = ffun2(x) + 0.1*rnorm(length(x))
nlevel = 5
xphi= BasisHaar(x,nlevel)
fit = lm(y ~ xphi)
beta = coef(fit)
par(mfrow=c(2,2))
plot(beta[-1], type="h", col=1+as.numeric(colnames(xphi)),
main="Wavelet Coefficients (nlevel=5)", ylab="")
abline(h=0, lty=3, col=1)
matplot(x, cbind(y,fit$fit), ylab = "y",
type="l", lty=1, col=c(1,4), lwd=c(1,2),
main="Wavelet Estimate")
thres = 0.1*sqrt(2*log(length(x))/length(x))
beta = ifelse(abs(beta)>thres, beta, 0)
# sign(beta)*pmax(abs(beta)-thres, 0) # soft
plot(beta[-1], type="h", col=1+as.numeric(colnames(xphi)),
main=paste("Wavelet Shrinkage with lambda =", round(thres,3)), ylab="")
abline(h=0, lty=3, col=1)
abline(h=c(-thres, thres), lty=2, col=1)
yhat = cbind(rep(1,dim(xphi)[1]), xphi)%*% beta
matplot(x, cbind(y,yhat), ylab = "y",
type="l", lty=1, col=c(1,4), lwd=c(1,2),
main="Wavelet Shrinkage Smoothing")
```