This session introduces regularization techniques using the example of smoothing spline and shrinkage methods.

**Boston Housing data**

The Boston data frame has 506 rows and 14 columns. medv is the response variable \(y\).

This data frame contains the following columns:

- “crim”: per capita crime rate by town.
- “zn”: proportion of residential land zoned for lots over 25,000 sq.ft.
- “indus”: proportion of non-retail business acres per town.
- “chas”: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
- “nox”: nitrogen oxides concentration (parts per 10 million).
- “rm”: average number of rooms per dwelling.
- “age”: proportion of owner-occupied units built prior to 1940.
- “dis”: weighted mean of distances to five Boston employment centres.
- “rad”: index of accessibility to radial highways.
- “tax”: full-value property-tax rate per $10,000.
- “ptratio”: pupil-teacher ratio by town.
- “black”: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
- “lstat”: lower status of the population (percent).
- “medv”: median value of owner-occupied homes in $1000s.

```
library(MASS)
library(splines)
```

Smoothing spline is a popular tool for nonparametric smoothing. Unlike regression splines with knot selection (controlling the number of basis functions), the smoothing spline use many basis functions by taking each distinct observation as a knot, then control the degree of smoothness through penalizing the roughness/curvature characteristics of the target function. Let \(\Phi\) be a matrix with \(\Phi(i,j) = f_j(x_i)\), the objective function can be written as: \[ \min_{\boldsymbol\beta} (y-\Phi\beta)^T(y-\Phi\beta)+\lambda\beta^T\Omega_\Phi\beta \] where \(\Omega_\Phi(i,j)=\int f_i^{(k+1)/2}(t)f_j^{(k+1)/2}(t) dt\). By simple calculus, the coefficients \(\hat{\beta}\) that minimize the above function can be written as: \[\hat{\beta} = (\Phi^T\Phi+\lambda\Omega_\Phi)^{-1}\Phi^Ty\] The predicted values is a linear function of the observed values: \[\hat{y} = \Phi(\Phi^T\Phi+\lambda\Omega_\Phi)^{-1}\Phi^Ty = S_\lambda y\]

where \(\mathbf{S}_\lambda\) is called the smoothing matrix.

**Hold Out Validation**

Hold out validation is the simplest kind of cross validation. For example, we randomly sample 80% of the original data and use it as the training set. The remaining 20% is used as Hold Out Validation set. The regression model will be built on the training set and future performance of the model will be evaluated on the Hold Out Validation set.

```
sample_index = sample(1:nrow(Boston),floor(nrow(Boston)*0.80))
Boston_train = Boston[sample_index,]
Boston_test = Boston[-sample_index,]
```

**K-fold Cross Validation**

In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data.

```
library(boot)
model = glm(medv~lstat, data = Boston)
res = cv.glm(data = Boston, glmfit = model, K = 5)
res$delta[2] # the adjusted cross-validation error
```

`## [1] 38.69906`

The choice of K is actually a trade-off between bias and variance. With K increasing, the similarity between each sub-training set will also increase, which would lead to higher test error. Empirically, we choose k = 5 or 10.

**Leave-one-out Cross Validation (LOOCV)**

An special case of K fold CV. \[ {\rm LOOCV}(\lambda) = \frac{1}{n}\sum_{i=1}^n \Big (y_i - \hat{f}^{[i]}(x_i)\Big)^2 \]

`cv.glm(data = Boston, glmfit = model, K = nrow(Boston))$delta[2]`

`## [1] 38.88969`

Upon some relaxation, the CV score can be approximated by a so-called generalized cross-validation (GCV) score: \[ GCV(\lambda) = \frac{1}{n}\sum_{i=1}^n\Big(\frac{y_i-\hat{f}(x_i)}{1-{\rm tr}(\mathbf{S}_\lambda)/n}\Big)^2 \]

```
library(splines)
library(glmnet)
# help(smooth.spline)
x <- Boston$lstat
y <- Boston$medv
n <- length(unique(x))
gcv <- numeric(100)
df <- seq(2,n,length=100) #the desired df which is the trace of $$S_\lambda$$
for (i in 1:100)
{
tmp <- smooth.spline(x,y,df=df[i], cv = FALSE) #generalized cross-validation (GCV)
gcv[i] <-tmp$cv.crit #cv.crit - cross-validation score
}
#Obtain the best df which gives minimum cross-validation score
min_df <- df[which.min(gcv)]
plot(df,gcv ,type="l")
abline(v=c(min_df),lty=2,col="darkgreen")
text(min_df+70,min(gcv)+0.05,paste("Min df =",round(min_df,4)),srt=0.2,pos=3)
```

```
#visualize the fitness with different dfs
plot(x,y,lwd=1,xlab='lstat',ylab='medv',main='Smoothing spline')
fit = smooth.spline(x,y)
lines(fit$x, smooth.spline(x,y,df=min_df)$y,col=1,lwd=2)
lines(fit$x, smooth.spline(x,y,df=8)$y,col=2,lwd=2)
lines(fit$x, smooth.spline(x,y,df=5)$y,col=3,lwd=2)
lines(fit$x, smooth.spline(x,y,df=3)$y,col=4,lwd=2)
legend("topright", c("gvc","df=8","df=5","df=3"), col = c(1,2,3,4),lty=c(1,1,1,1))
```

Besides the variable selection techniques like stepwise regression, here, we discuss an alternative by regularizing the coefficient, or equivalently, shrinking the coefficient estimates towards zero. The shrinking of the coefficient estimates could significantly reduce their variance. The two best-known techniques for shrinking the coefficient estimates towards zero are the *Ridge regression* and the *Lasso*, or their combination the so called *Elastic Net*.

Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS, however they also have a shrinkage penalty when the coefficients come closer to zero. This penalty has the effect of shrinking the coefficient estimates towards zero. \[\hat{\beta}^{ridge} = arg\min_{\beta} \{\sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda\sum_{j=1}^p\beta_j^2\} \]

The parameter, \(\lambda\), is a complexity parameter that controls the amount of the shrinking. The larger the value of \(\lambda\), the greater the amount of shrinkage. \(\lambda=0\) will behave exactly like OLS regression. Of course, selection a good value for \(\lambda\) is critical, and should be chosen using Cross Validation techniques.

**The Advantage of Ridge Regression?**

The advantage of ridge regression is the *bias-variance trade-off*. As \(\lambda\) increases, the flexibility of the ridge regression fit decreases. This leads to decrease variance, with a smaller increase in bias. Regular OLS regression is fixed with high variance, but no bias. Ridge regression works best in situations for least squares estimates have high variance. Ridge regression is also much more computationally efficient that any subset method, since it is possible to simultaneously solve for all values of \(\lambda\).

**Notice**

Ridge regression requires the predictors \(X\) to be centered to have

*mean = 0*, thus the data must be standardized before hand.The shrinkage does not apply to the intercept.

Ridge regression has at least one disadvantage that it includes all \(p\) predictors in the final model. The penalty term sets many of them close to zero, but never exactly to zero. This may not impact prediction accuracy, but make the model hard to interpret. Lasso (*least absolute shrinkage and selection operato*) overcomes this disadvantage and is capable of forcing some of the coefficients to zero.

\[ \hat{\beta}^{lasso} = arg\min_{\beta} \{\sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda\sum_{j=1}^p|\beta_j|\} \]

The \(L_2\) ridge penalty \(\sum_{j=1}^p\beta_j^2\) is replaced by the \(L_1\) lasso penalty \(\sum_{j=1}^p|\beta_j|\) The latter constraint makes the solutions nonlinear in the \(y_i\), and there is no closed form expression as in ridge regression.

The elastic net is another regularized regression that linearly combines the \(L_1\) and \(L_2\) penalties of the lasso and ridge methods. \[\hat{\beta}^{elastic} = arg\min_{\beta}\{ \sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda(\alpha \sum_{j=1}^p|\beta_j|+(1-\alpha)\sum_{j=1}^p\beta_j^2)\}\]

```
library(glmnet)
# linear regression with lm
for (i in 1:ncol(Boston)){
if (i != 14) {
Boston[,i] = scale(Boston[,i])
}
}
x <- as.matrix(Boston[1:13])
y <- as.matrix(Boston$medv)
lm_y = lm(y ~ x)
summary(lm_y)
```

```
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.53281 0.21095 106.814 < 2e-16 ***
## xcrim -0.92906 0.28269 -3.287 0.001087 **
## xzn 1.08264 0.32016 3.382 0.000778 ***
## xindus 0.14104 0.42188 0.334 0.738288
## xchas 0.68241 0.21884 3.118 0.001925 **
## xnox -2.05875 0.44262 -4.651 4.25e-06 ***
## xrm 2.67688 0.29364 9.116 < 2e-16 ***
## xage 0.01949 0.37184 0.052 0.958229
## xdis -3.10712 0.41999 -7.398 6.01e-13 ***
## xrad 2.66485 0.57770 4.613 5.07e-06 ***
## xtax -2.07884 0.63379 -3.280 0.001112 **
## xptratio -2.06265 0.28323 -7.283 1.31e-12 ***
## xblack 0.85011 0.24521 3.467 0.000573 ***
## xlstat -3.74733 0.36216 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
```

```
# ridge regression with glmnet
ridge = cv.glmnet(x,y, alpha=0, nfolds =5)
plot(ridge)
```

`ridge$lambda.min`

`## [1] 0.7438467`

`as.matrix(coef(ridge, s=ridge$lambda.min))`

```
## 1
## (Intercept) 22.5328063
## crim -0.7440596
## zn 0.7450319
## indus -0.2765354
## chas 0.7381424
## nox -1.3395798
## rm 2.8214531
## age -0.1113097
## dis -2.3028934
## rad 1.2770689
## tax -0.9274417
## ptratio -1.8366613
## black 0.8257541
## lstat -3.3444950
```

Lasso

```
# Lasso regression with glmnet
lasso=cv.glmnet(x,y, alpha=1, nfolds =5)
plot(lasso)
```

`lasso$lambda.min`

`## [1] 0.01758818`

`as.matrix(coef(lasso, s=lasso$lambda.min))`

```
## 1
## (Intercept) 22.5328063
## crim -0.8770080
## zn 1.0036257
## indus 0.0000000
## chas 0.6854079
## nox -1.9347331
## rm 2.7005815
## age 0.0000000
## dis -3.0138641
## rad 2.3469474
## tax -1.7744571
## ptratio -2.0264072
## black 0.8329292
## lstat -3.7314758
```

```
ss = exp(seq(-10, 0, by=0.1))
lassofit = glmnet(x, y)
par(mfrow=c(1,2))
plot.glmnet(lassofit, xvar="lambda", label=T)
plot.glmnet(lassofit, xvar="norm", label=T)
```

```
# coefficient profile plot.
# In both plots, each colored line represents the value taken by a different coefficient in your model.
# Lambda is the weight given to the regularization term (the L1 norm), so as lambda approaches zero, the loss function of your model approaches the OLS loss function.
```

LSData is the daily stock prices, volumes of the 50 HSI constituent stocks since 2010. VR = daily volume*daily range

```
library(stringr)
#The data is truncated until 2017-03-03.
LD <- LSData[1:1548,]
head(LD[,1:12],6)
```

```
## Date S0001_Open S0001_High S0001_Low S0001_Close S0001_Volume
## 1 2010-10-29 112.304 112.304 110.889 84.3367 6069243
## 2 2010-11-01 112.304 116.358 112.304 87.7674 5502837
## 3 2010-11-02 115.793 116.453 115.415 87.8389 3454941
## 4 2010-11-03 115.510 119.470 115.415 90.1260 9849580
## 5 2010-11-04 119.281 122.582 119.281 92.8419 10518249
## 6 2010-11-05 124.750 125.316 119.847 92.9134 10008060
## S0001_Adjusted S0001_Range S0001_Direction S0001_scaled_Volume S0001_VR
## 1 63.92475 1.415 -1 6.069243 8.587979
## 2 66.52512 4.054 -1 5.502837 22.308501
## 3 66.57931 1.038 -1 3.454941 3.586229
## 4 68.31285 4.055 -1 9.849580 39.940047
## 5 70.37145 3.301 -1 10.518249 34.720740
## 6 70.42564 5.469 -1 10.008060 54.734080
## S0001_logClose
## 1 4.434817
## 2 4.474690
## 3 4.475504
## 4 4.501209
## 5 4.530898
## 6 4.531668
```

```
#Extract only the VR variables from the data frame
VRpos <- grep("VR",colnames(LD),value=FALSE)
VRpos <- VRpos[-length(VRpos)] # Delete the last column as it it "HSI".
VR <- LD[VRpos]
# convert the VR variables to their spectral densities
X <- spectrum(VR)
```