---
title: "Tutorial 5: Regularization"
author: "[Simon & Zebin]"
date: "2/27/2018 & 3/1/2018"
output:
html_document:
highlight: tango
number_sections: yes
theme: paper
toc: yes
toc_depth: 3
html_notebook: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This session introduces regularization techniques using the example of smoothing spline and shrinkage methods.
**Boston Housing data**
The Boston data frame has 506 rows and 14 columns. medv is the response variable $y$.
This data frame contains the following columns:
* "crim": per capita crime rate by town.
* "zn": proportion of residential land zoned for lots over 25,000 sq.ft.
* "indus": proportion of non-retail business acres per town.
* "chas": Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
* "nox": nitrogen oxides concentration (parts per 10 million).
* "rm": average number of rooms per dwelling.
* "age": proportion of owner-occupied units built prior to 1940.
* "dis": weighted mean of distances to five Boston employment centres.
* "rad": index of accessibility to radial highways.
* "tax": full-value property-tax rate per \$10,000.
* "ptratio": pupil-teacher ratio by town.
* "black": 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
* "lstat": lower status of the population (percent).
* "medv": median value of owner-occupied homes in \$1000s.
```{r warning=FALSE, echo=TRUE, message=FALSE}
library(MASS)
library(splines)
```
# Smoothing Spline
## Introduction
Smoothing spline is a popular tool for nonparametric smoothing. Unlike regression splines with knot selection (controlling the number of basis functions), the smoothing spline use many basis functions by taking each distinct observation as a knot, then control the degree of smoothness through penalizing the roughness/curvature characteristics of the target function. Let $\Phi$ be a matrix with $\Phi(i,j) = f_j(x_i)$, the objective function can be written as:
$$
\min_{\boldsymbol\beta}
(y-\Phi\beta)^T(y-\Phi\beta)+\lambda\beta^T\Omega_\Phi\beta
$$
where $\Omega_\Phi(i,j)=\int f_i^{(k+1)/2}(t)f_j^{(k+1)/2}(t) dt$. By simple calculus, the coefficients $\hat{\beta}$ that minimize the above function can be written as:
$$\hat{\beta} = (\Phi^T\Phi+\lambda\Omega_\Phi)^{-1}\Phi^Ty$$
The predicted values is a linear function of the observed values:
$$\hat{y} = \Phi(\Phi^T\Phi+\lambda\Omega_\Phi)^{-1}\Phi^Ty = S_\lambda y$$
where $\mathbf{S}_\lambda$ is called the smoothing matrix.
## Parameter Selection
### Popular Cross Validation Methods
**Hold Out Validation**
Hold out validation is the simplest kind of cross validation. For example, we randomly sample 80% of the original data and use it as the training set. The remaining 20% is used as Hold Out Validation set. The regression model will be built on the training set and future performance of the model will be evaluated on the Hold Out Validation set.
```{r}
sample_index = sample(1:nrow(Boston),floor(nrow(Boston)*0.80))
Boston_train = Boston[sample_index,]
Boston_test = Boston[-sample_index,]
```
**K-fold Cross Validation**
In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data.
```{r}
library(boot)
model = glm(medv~lstat, data = Boston)
res = cv.glm(data = Boston, glmfit = model, K = 5)
res$delta[2] # the adjusted cross-validation error
```
The choice of K is actually a trade-off between bias and variance. With K increasing, the similarity between each sub-training set will also increase, which would lead to higher test error.
Empirically, we choose k = 5 or 10.
**Leave-one-out Cross Validation (LOOCV)**
An special case of K fold CV.
$$
{\rm LOOCV}(\lambda) = \frac{1}{n}\sum_{i=1}^n \Big (y_i - \hat{f}^{[i]}(x_i)\Big)^2
$$
```{r}
cv.glm(data = Boston, glmfit = model, K = nrow(Boston))$delta[2]
```
### Generalized cross-validation (GCV) for smoothing spline
Upon some relaxation, the CV score can be approximated by a so-called generalized cross-validation (GCV) score:
$$
GCV(\lambda) = \frac{1}{n}\sum_{i=1}^n\Big(\frac{y_i-\hat{f}(x_i)}{1-{\rm tr}(\mathbf{S}_\lambda)/n}\Big)^2
$$
```{r warning=FALSE, message=FALSE}
library(splines)
library(glmnet)
# help(smooth.spline)
x <- Boston$lstat
y <- Boston$medv
n <- length(unique(x))
gcv <- numeric(100)
df <- seq(2,n,length=100) #the desired df which is the trace of $$S_\lambda$$
for (i in 1:100)
{
tmp <- smooth.spline(x,y,df=df[i], cv = FALSE) #generalized cross-validation (GCV)
gcv[i] <-tmp$cv.crit #cv.crit - cross-validation score
}
#Obtain the best df which gives minimum cross-validation score
min_df <- df[which.min(gcv)]
plot(df,gcv ,type="l")
abline(v=c(min_df),lty=2,col="darkgreen")
text(min_df+70,min(gcv)+0.05,paste("Min df =",round(min_df,4)),srt=0.2,pos=3)
```
```{r}
#visualize the fitness with different dfs
plot(x,y,lwd=1,xlab='lstat',ylab='medv',main='Smoothing spline')
fit = smooth.spline(x,y)
lines(fit$x, smooth.spline(x,y,df=min_df)$y,col=1,lwd=2)
lines(fit$x, smooth.spline(x,y,df=8)$y,col=2,lwd=2)
lines(fit$x, smooth.spline(x,y,df=5)$y,col=3,lwd=2)
lines(fit$x, smooth.spline(x,y,df=3)$y,col=4,lwd=2)
legend("topright", c("gvc","df=8","df=5","df=3"), col = c(1,2,3,4),lty=c(1,1,1,1))
```
# Shrinkage Methods
Besides the variable selection techniques like stepwise regression, here, we discuss an alternative by regularizing the coefficient, or equivalently, shrinking the coefficient estimates towards zero. The shrinking of the coefficient estimates could significantly reduce their variance. The two best-known techniques for shrinking the coefficient estimates towards zero are the *Ridge regression* and the *Lasso*, or their combination the so called *Elastic Net*.
## Ridge Regression
Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS, however they also have a shrinkage penalty when the coefficients come closer to zero. This penalty has the effect of shrinking the coefficient estimates towards zero.
$$\hat{\beta}^{ridge} =
arg\min_{\beta} \{\sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda\sum_{j=1}^p\beta_j^2\}
$$
The parameter, $\lambda$, is a complexity parameter that controls the amount of the shrinking. The larger the value of $\lambda$, the greater the amount of shrinkage. $\lambda=0$ will behave exactly like OLS regression. Of course, selection a good value for $\lambda$ is critical, and should be chosen using Cross Validation techniques.
**The Advantage of Ridge Regression?**
The advantage of ridge regression is the *bias-variance trade-off*. As $\lambda$ increases, the flexibility of the ridge regression fit decreases. This leads to decrease variance, with a smaller increase in bias. Regular OLS regression is fixed with high variance, but no bias. Ridge regression works best in situations for least squares estimates have high variance. Ridge regression is also much more computationally efficient that any subset method, since it is possible to simultaneously solve for all values of $\lambda$.
**Notice**
* Ridge regression requires the predictors $X$ to be centered to have *mean = 0*, thus the data must be standardized before hand.
* The shrinkage does not apply to the intercept.
## Lasso
Ridge regression has at least one disadvantage that it includes all $p$ predictors in the final model. The penalty term sets many of them close to zero, but never exactly to zero. This may not impact prediction accuracy, but make the model hard to interpret. Lasso (*least absolute shrinkage and selection operato*) overcomes this disadvantage and is capable of forcing some of the coefficients to zero.
$$
\hat{\beta}^{lasso} = arg\min_{\beta} \{\sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda\sum_{j=1}^p|\beta_j|\}
$$
The $L_2$ ridge penalty $\sum_{j=1}^p\beta_j^2$ is replaced by the $L_1$ lasso penalty $\sum_{j=1}^p|\beta_j|$ The latter constraint makes the solutions nonlinear in the $y_i$, and there is no closed form expression as in ridge regression.
## Elastic Net
The elastic net is another regularized regression that linearly combines the $L_1$ and $L_2$ penalties of the lasso and ridge methods.
$$\hat{\beta}^{elastic} = arg\min_{\beta}\{ \sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda(\alpha \sum_{j=1}^p|\beta_j|+(1-\alpha)\sum_{j=1}^p\beta_j^2)\}$$
```{r}
library(glmnet)
# linear regression with lm
for (i in 1:ncol(Boston)){
if (i != 14) {
Boston[,i] = scale(Boston[,i])
}
}
x <- as.matrix(Boston[1:13])
y <- as.matrix(Boston$medv)
lm_y = lm(y ~ x)
summary(lm_y)
```
```{r}
# ridge regression with glmnet
ridge = cv.glmnet(x,y, alpha=0, nfolds =5)
plot(ridge)
ridge$lambda.min
as.matrix(coef(ridge, s=ridge$lambda.min))
```
Lasso
```{r}
# Lasso regression with glmnet
lasso=cv.glmnet(x,y, alpha=1, nfolds =5)
plot(lasso)
lasso$lambda.min
as.matrix(coef(lasso, s=lasso$lambda.min))
ss = exp(seq(-10, 0, by=0.1))
lassofit = glmnet(x, y)
par(mfrow=c(1,2))
plot.glmnet(lassofit, xvar="lambda", label=T)
plot.glmnet(lassofit, xvar="norm", label=T)
# coefficient profile plot.
# In both plots, each colored line represents the value taken by a different coefficient in your model.
# Lambda is the weight given to the regularization term (the L1 norm), so as lambda approaches zero, the loss function of your model approaches the OLS loss function.
```
# Extension: Logistic Regression Using glmnet
LSData is the daily stock prices, volumes of the 50 HSI constituent stocks since 2010.
VR = daily volume*daily range
```{r echo=FALSE, result='hide'}
setwd("E:/HKU data/Courses/Stat_3612/Tutorial2018/T5")
LSData <- readRDS("LSData.Rds")
```
```{r message=FALSE, warning=FALSE}
library(stringr)
#The data is truncated until 2017-03-03.
LD <- LSData[1:1548,]
head(LD[,1:12],6)
#Extract only the VR variables from the data frame
VRpos <- grep("VR",colnames(LD),value=FALSE)
VRpos <- VRpos[-length(VRpos)] # Delete the last column as it it "HSI".
VR <- LD[VRpos]
# convert the VR variables to their spectral densities
X <- spectrum(VR)
# spectral density is stored in the spec variable.
U <- X$spec
U_train <- U[,-c(5,7,27)] #excludes 0005, 0011 and 0700 as testing
U_test <- U[,c(5,7,27)]
# Fitting a natural cubic spline to the frequencies
H <- ns(X$freq,df=6)
Y <- t(H)%*%U_train
stk_names <- X$snames[-c(5,7,27)]
stk_logClose <- str_replace(stk_names,"VR","logClose")
# Create the dependent binary variable.
# Those stocks whose price rises by about 1.3 fold since 2017-03-03 are considered worth of buying. Z=1 for those stocks.
Z <- ifelse(LSData[nrow(LSData),stk_logClose] - LD[nrow(LD),stk_logClose] > 0.262,1,0) #log(1.3)=0.262
Znames <- attr(Z,"dimnames")[[2]]
attr(Z,"dimnames") <- NULL
Z <- factor(Z)
names(Z) <- Znames
logmodel <- cv.glmnet(t(Y),Z, alpha=1,family="binomial")
coef <- as.matrix(coef(logmodel,s=logmodel$lambda.min))
# check the probability of achieving a rise of 1.3 folds on the testing stocks.
TY <- t(H)%*%U_test
exp(coef[1] + t(TY)%*%coef[2:length(coef)])/(1+exp(coef[1] + t(TY)%*%coef[2:length(coef)]))
```