This session introduces regularization techniques using the example of smoothing spline and shrinkage methods.

Boston Housing data

The Boston data frame has 506 rows and 14 columns. medv is the response variable \(y\).

This data frame contains the following columns:

library(MASS)
library(splines)

1 Smoothing Spline

1.1 Introduction

Smoothing spline is a popular tool for nonparametric smoothing. Unlike regression splines with knot selection (controlling the number of basis functions), the smoothing spline use many basis functions by taking each distinct observation as a knot, then control the degree of smoothness through penalizing the roughness/curvature characteristics of the target function. Let \(\Phi\) be a matrix with \(\Phi(i,j) = f_j(x_i)\), the objective function can be written as: \[ \min_{\boldsymbol\beta} (y-\Phi\beta)^T(y-\Phi\beta)+\lambda\beta^T\Omega_\Phi\beta \] where \(\Omega_\Phi(i,j)=\int f_i^{(k+1)/2}(t)f_j^{(k+1)/2}(t) dt\). By simple calculus, the coefficients \(\hat{\beta}\) that minimize the above function can be written as: \[\hat{\beta} = (\Phi^T\Phi+\lambda\Omega_\Phi)^{-1}\Phi^Ty\] The predicted values is a linear function of the observed values: \[\hat{y} = \Phi(\Phi^T\Phi+\lambda\Omega_\Phi)^{-1}\Phi^Ty = S_\lambda y\]

where \(\mathbf{S}_\lambda\) is called the smoothing matrix.

1.2 Parameter Selection

1.2.2 Generalized cross-validation (GCV) for smoothing spline

Upon some relaxation, the CV score can be approximated by a so-called generalized cross-validation (GCV) score: \[ GCV(\lambda) = \frac{1}{n}\sum_{i=1}^n\Big(\frac{y_i-\hat{f}(x_i)}{1-{\rm tr}(\mathbf{S}_\lambda)/n}\Big)^2 \]

library(splines)
library(glmnet)
# help(smooth.spline)
x <- Boston$lstat
y <- Boston$medv
n <- length(unique(x))
gcv <- numeric(100)
df <- seq(2,n,length=100) #the desired df which is the trace of $$S_\lambda$$
for (i in 1:100) 
{
  tmp <- smooth.spline(x,y,df=df[i], cv = FALSE) #generalized cross-validation (GCV)
  gcv[i] <-tmp$cv.crit  #cv.crit - cross-validation score
}
#Obtain the best df which gives minimum cross-validation score
min_df <- df[which.min(gcv)]

plot(df,gcv ,type="l")
abline(v=c(min_df),lty=2,col="darkgreen")
text(min_df+70,min(gcv)+0.05,paste("Min df =",round(min_df,4)),srt=0.2,pos=3)

#visualize the fitness with different dfs
plot(x,y,lwd=1,xlab='lstat',ylab='medv',main='Smoothing spline')
fit = smooth.spline(x,y)
lines(fit$x, smooth.spline(x,y,df=min_df)$y,col=1,lwd=2)
lines(fit$x, smooth.spline(x,y,df=8)$y,col=2,lwd=2)
lines(fit$x, smooth.spline(x,y,df=5)$y,col=3,lwd=2) 
lines(fit$x, smooth.spline(x,y,df=3)$y,col=4,lwd=2) 
legend("topright", c("gvc","df=8","df=5","df=3"), col = c(1,2,3,4),lty=c(1,1,1,1))

2 Shrinkage Methods

Besides the variable selection techniques like stepwise regression, here, we discuss an alternative by regularizing the coefficient, or equivalently, shrinking the coefficient estimates towards zero. The shrinking of the coefficient estimates could significantly reduce their variance. The two best-known techniques for shrinking the coefficient estimates towards zero are the Ridge regression and the Lasso, or their combination the so called Elastic Net.

2.1 Ridge Regression

Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS, however they also have a shrinkage penalty when the coefficients come closer to zero. This penalty has the effect of shrinking the coefficient estimates towards zero. \[\hat{\beta}^{ridge} = arg\min_{\beta} \{\sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda\sum_{j=1}^p\beta_j^2\} \]

The parameter, \(\lambda\), is a complexity parameter that controls the amount of the shrinking. The larger the value of \(\lambda\), the greater the amount of shrinkage. \(\lambda=0\) will behave exactly like OLS regression. Of course, selection a good value for \(\lambda\) is critical, and should be chosen using Cross Validation techniques.

The Advantage of Ridge Regression?

The advantage of ridge regression is the bias-variance trade-off. As \(\lambda\) increases, the flexibility of the ridge regression fit decreases. This leads to decrease variance, with a smaller increase in bias. Regular OLS regression is fixed with high variance, but no bias. Ridge regression works best in situations for least squares estimates have high variance. Ridge regression is also much more computationally efficient that any subset method, since it is possible to simultaneously solve for all values of \(\lambda\).

Notice

  • Ridge regression requires the predictors \(X\) to be centered to have mean = 0, thus the data must be standardized before hand.

  • The shrinkage does not apply to the intercept.

2.2 Lasso

Ridge regression has at least one disadvantage that it includes all \(p\) predictors in the final model. The penalty term sets many of them close to zero, but never exactly to zero. This may not impact prediction accuracy, but make the model hard to interpret. Lasso (least absolute shrinkage and selection operato) overcomes this disadvantage and is capable of forcing some of the coefficients to zero.

\[ \hat{\beta}^{lasso} = arg\min_{\beta} \{\sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda\sum_{j=1}^p|\beta_j|\} \]

The \(L_2\) ridge penalty \(\sum_{j=1}^p\beta_j^2\) is replaced by the \(L_1\) lasso penalty \(\sum_{j=1}^p|\beta_j|\) The latter constraint makes the solutions nonlinear in the \(y_i\), and there is no closed form expression as in ridge regression.

2.3 Elastic Net

The elastic net is another regularized regression that linearly combines the \(L_1\) and \(L_2\) penalties of the lasso and ridge methods. \[\hat{\beta}^{elastic} = arg\min_{\beta}\{ \sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda(\alpha \sum_{j=1}^p|\beta_j|+(1-\alpha)\sum_{j=1}^p\beta_j^2)\}\]

library(glmnet)
# linear regression with lm
for (i in 1:ncol(Boston)){
    if (i != 14) {
        Boston[,i] = scale(Boston[,i])
    }
}
x <- as.matrix(Boston[1:13])
y <- as.matrix(Boston$medv)
lm_y = lm(y ~ x)
summary(lm_y)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22.53281    0.21095 106.814  < 2e-16 ***
## xcrim       -0.92906    0.28269  -3.287 0.001087 ** 
## xzn          1.08264    0.32016   3.382 0.000778 ***
## xindus       0.14104    0.42188   0.334 0.738288    
## xchas        0.68241    0.21884   3.118 0.001925 ** 
## xnox        -2.05875    0.44262  -4.651 4.25e-06 ***
## xrm          2.67688    0.29364   9.116  < 2e-16 ***
## xage         0.01949    0.37184   0.052 0.958229    
## xdis        -3.10712    0.41999  -7.398 6.01e-13 ***
## xrad         2.66485    0.57770   4.613 5.07e-06 ***
## xtax        -2.07884    0.63379  -3.280 0.001112 ** 
## xptratio    -2.06265    0.28323  -7.283 1.31e-12 ***
## xblack       0.85011    0.24521   3.467 0.000573 ***
## xlstat      -3.74733    0.36216 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
# ridge regression with glmnet
ridge = cv.glmnet(x,y, alpha=0, nfolds =5)
plot(ridge)

ridge$lambda.min
## [1] 0.7438467
as.matrix(coef(ridge, s=ridge$lambda.min))
##                      1
## (Intercept) 22.5328063
## crim        -0.7440596
## zn           0.7450319
## indus       -0.2765354
## chas         0.7381424
## nox         -1.3395798
## rm           2.8214531
## age         -0.1113097
## dis         -2.3028934
## rad          1.2770689
## tax         -0.9274417
## ptratio     -1.8366613
## black        0.8257541
## lstat       -3.3444950

Lasso

# Lasso regression with glmnet
lasso=cv.glmnet(x,y, alpha=1, nfolds =5)
plot(lasso)

lasso$lambda.min
## [1] 0.01758818
as.matrix(coef(lasso, s=lasso$lambda.min))
##                      1
## (Intercept) 22.5328063
## crim        -0.8770080
## zn           1.0036257
## indus        0.0000000
## chas         0.6854079
## nox         -1.9347331
## rm           2.7005815
## age          0.0000000
## dis         -3.0138641
## rad          2.3469474
## tax         -1.7744571
## ptratio     -2.0264072
## black        0.8329292
## lstat       -3.7314758
ss = exp(seq(-10, 0,  by=0.1))
lassofit = glmnet(x, y)
par(mfrow=c(1,2))
plot.glmnet(lassofit, xvar="lambda", label=T) 
plot.glmnet(lassofit, xvar="norm", label=T)

# coefficient profile plot. 
# In both plots, each colored line represents the value taken by a different coefficient in your model. 
# Lambda is the weight given to the regularization term (the L1 norm), so as lambda approaches zero, the loss function of your model approaches the OLS loss function. 

3 Extension: Logistic Regression Using glmnet

LSData is the daily stock prices, volumes of the 50 HSI constituent stocks since 2010. VR = daily volume*daily range

library(stringr)
#The data is truncated until 2017-03-03.
LD <- LSData[1:1548,]
head(LD[,1:12],6)
##         Date S0001_Open S0001_High S0001_Low S0001_Close S0001_Volume
## 1 2010-10-29    112.304    112.304   110.889     84.3367      6069243
## 2 2010-11-01    112.304    116.358   112.304     87.7674      5502837
## 3 2010-11-02    115.793    116.453   115.415     87.8389      3454941
## 4 2010-11-03    115.510    119.470   115.415     90.1260      9849580
## 5 2010-11-04    119.281    122.582   119.281     92.8419     10518249
## 6 2010-11-05    124.750    125.316   119.847     92.9134     10008060
##   S0001_Adjusted S0001_Range S0001_Direction S0001_scaled_Volume  S0001_VR
## 1       63.92475       1.415              -1            6.069243  8.587979
## 2       66.52512       4.054              -1            5.502837 22.308501
## 3       66.57931       1.038              -1            3.454941  3.586229
## 4       68.31285       4.055              -1            9.849580 39.940047
## 5       70.37145       3.301              -1           10.518249 34.720740
## 6       70.42564       5.469              -1           10.008060 54.734080
##   S0001_logClose
## 1       4.434817
## 2       4.474690
## 3       4.475504
## 4       4.501209
## 5       4.530898
## 6       4.531668
#Extract only the VR variables from the data frame
VRpos <- grep("VR",colnames(LD),value=FALSE)
VRpos <- VRpos[-length(VRpos)] # Delete the last column as it it "HSI".
VR <- LD[VRpos]

# convert the VR variables to their spectral densities
X <- spectrum(VR)