Smoothing Spline

Letting \(\Phi\) be a matrix with \(\Phi(i,j) = f_j(x_i)\), the objective function can be written as: \[(y-\Phi\beta)^T(y-\Phi\beta)+\lambda\beta^T\Omega_\Phi\beta\] where \(\Omega_\Phi(i,j)=\int f_i^{(k+1)/2}(t)f_j^{(k+1)/2}(t) d t\) By simple calculus, the coefficients \(\hat{\beta}\) that minimize the above function can be written as: \[\hat{\beta} = (\Phi^T\Phi+\lambda\Omega_\Phi)^{-1}\Phi^Ty\] The predicted values is a linear function of the observed values: \[\hat{y} = \Phi(\Phi^T\Phi+\lambda\Omega_\Phi)^{-1}\Phi^Ty = S_\lambda y\] The degree of freedom for a smoothing spline are: \[Trace(S_\lambda) = S_\lambda(1,1) + S_\lambda(2,2) + ... + S_\lambda(n,n) \]

Choosing the regularization parameter \(\lambda\) by Cross Validation

\[RSS_{LOOCV}(\lambda) = \frac{1}{n}\sum_{i=1}^n\Big(y_i-\hat{f}^{-1}(x_i)\Big)^2 = \frac{1}{n}\sum_{i=1}^n\Big(\frac{y_i-\hat{f}(x_i)}{1-S_{ii}}\Big)^2\]

library(MASS)
help(mcycle)
x <- mcycle$times
y <- mcycle$accel
help(smooth.spline)
n <- length(unique(x))
gcv <- numeric(100)
df <- seq(2,n,l=100)
for (i in 1:100) gcv[i] <- smooth.spline(x,y,df=df[i])$cv.crit
plot(df,gcv ,type="l")

df[which.min(gcv)]
## [1] 12.22222
plot(x,y,lwd=1,xlab='time',ylab='acceleration',main='Smoothing spline')
fit = smooth.spline(x,y)
lines(fit$x, smooth.spline(x,y,df=12.2222)$y,col=1,lwd=2)
lines(fit$x, smooth.spline(x,y,df=8)$y,col=2,lwd=2)
lines(fit$x, smooth.spline(x,y,df=5)$y,col=3,lwd=2) 
lines(fit$x, smooth.spline(x,y,df=3)$y,col=4,lwd=2) 
legend("bottomright", c("gvc","df=8","df=5","df=3"), col = c(1,2,3,4),lty=c(1,1,1,1))

Shrinkage Methods

We have talked about variables selection methods through AIC, BIC and Mallow’s Cp. These selection methods use least squares estimator to choose the best model and estimate test error. Here, we discuss an alternative where we fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. The shrinking of the coefficient estimates has the effect of significantly reducing their variance. The two best-known techniques for shrinking the coefficient estimates towards zero are the Ridge regression and the Lasso.

Ridge Regression

Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS, however they also have a shrinkage penalty when the coefficients come closer to zero. This penalty has the effect of shrinking the coefficient estimates towards zero. \[\hat{\beta}^{ridge} = argmin_{\beta}\{ \sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda\sum_{j=1}^p\beta_j^2\}\] The parameter, \(\lambda\), is a complexity parameter that controls the amount of the shrinking. The larger the value of \(\lambda\), the greater the amount of shrinkage.\(\lambda=0\) will behave exactly like OLS regression. Of course, selection a good value for \(\lambda\) is critical, and should be chosen using Cross Validation techniques.

Why is ridge regression better than least squares?

The advantage of ridge regression is the bias-variance trade-off. As \(\lambda\) increases, the flexibility of the ridge regression fit decreases. This leads to decrease variance, with a smaller increase in bias. Regular OLS regression is fixed with high variance, but no bias.

Ridge regression works best in situations for least squares estimates have high variance. Ridge regression is also much more computationally efficient that any subset method, since it is possible to simultaneously solve for all values of \(\lambda\).

Notice:

  • Ridge regression requires the predictors \(X\) to be centered to have mean = 0, thus the data must be standardized before hand.

  • The shrinkage does not apply to the intercept.

Lasso

Ridge regression has at least one disadvantage that it includes all \(p\) predictors in the final model. The penalty term sets many of them close to zero, but never exactly to zero. This isn’t generally a problem for prediction accuracy, but it can make the model more difficult to interpret. Lasso overcomes this disadvantage and is capable of forcing some of the coefficients to zero.

LASSO (least absolute shrinkage and selection operato) is a shrinkage method similar to ridge regression, with subtle but important differences. The lasso coefficients aim to minimize a penalized residual sum of squares: \[\hat{\beta}^{lasso} = argmin_{\beta}\{ \sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda\sum_{j=1}^p|\beta_j|\}\]

The \(L_2\) ridge penalty \(\sum_{j=1}^p\beta_j^2\) is replaced by the \(L_1\) lasso penalty \(\sum_{j=1}^p|\beta_j|\) The latter constraint makes the solutions nonlinear in the \(y_i\), and there is no closed form expression as in ridge regression.

Elastic Net

The elastic net is another regularized regression that linearly combines the \(L_1\) and \(L_2\) penalties of the lasso and ridge methods. \[\hat{\beta}^{lasso} = argmin_{\beta}\{ \sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda(\alpha \sum_{j=1}^p|\beta_j|+(1-\alpha)\sum_{j=1}^p\beta_j^2)\}\]

Generate data

set.seed(2017)
x1 <- runif(50, 0, 1)
x2 <- x1 + rnorm(50, 0, 0.25)
x3 <- (x1 + x2)/2 + runif(50, 0, 0.1)
x4 <- runif(50, 0, 1)
x5 <- (2*x4 + rnorm(50, 0, 0.25))/2 + runif(50, 0, 0.1)
x6 <- runif(50, 0, 1)
y <- (3 + x1 + x2 + 0.5*x3 + 0.75*x4 + 0.5*x5 + 0.5*x6 + rnorm(50, 0, 1))
x <- scale( cbind(x1,x2,x3,x4,x5,x6) )
trdata <- data.frame( cbind(x,y) )
names(trdata) <- c("sx1", "sx2", "sx3", "sx4", "sx5", "sx6", "y")
attach(trdata)
## The following object is masked _by_ .GlobalEnv:
## 
##     y
cor(trdata)
##             sx1       sx2       sx3        sx4         sx5        sx6
## sx1  1.00000000 0.7395644 0.9157676 0.06348506 -0.03069269 0.09950138
## sx2  0.73956435 1.0000000 0.9419738 0.24065730  0.15736081 0.15027626
## sx3  0.91576759 0.9419738 1.0000000 0.18099247  0.08958100 0.12266726
## sx4  0.06348506 0.2406573 0.1809925 1.00000000  0.92849948 0.13192584
## sx5 -0.03069269 0.1573608 0.0895810 0.92849948  1.00000000 0.11374387
## sx6  0.09950138 0.1502763 0.1226673 0.13192584  0.11374387 1.00000000
## y    0.55255230 0.5052255 0.5469725 0.13385738  0.01192527 0.30679756
##              y
## sx1 0.55255230
## sx2 0.50522548
## sx3 0.54697253
## sx4 0.13385738
## sx5 0.01192527
## sx6 0.30679756
## y   1.00000000
# Use OLS to fit a linear model using all of predictors.
ols1 <- lm(y ~ sx1 + sx2 + sx3 + sx4 + sx5 + sx6)
summary(ols1)
## 
## Call:
## lm(formula = y ~ sx1 + sx2 + sx3 + sx4 + sx5 + sx6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.65691 -0.51683  0.02177  0.65113  1.92075 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0668     0.1575  32.161   <2e-16 ***
## sx1           1.6955     1.0092   1.680   0.1002    
## sx2           1.6263     1.1918   1.365   0.1795    
## sx3          -2.4255     2.0042  -1.210   0.2328    
## sx4           0.6537     0.4462   1.465   0.1502    
## sx5          -0.6103     0.4449  -1.372   0.1772    
## sx6           0.2932     0.1644   1.783   0.0817 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.114 on 43 degrees of freedom
## Multiple R-squared:  0.434,  Adjusted R-squared:  0.355 
## F-statistic: 5.495 on 6 and 43 DF,  p-value: 0.0002711
# Use a backwards elimination strategy and fit a model with sx3 omitted.
ols2 <- lm(y ~ sx1 + sx2 + sx4 + sx5 + sx6)
summary(ols2)
## 
## Call:
## lm(formula = y ~ sx1 + sx2 + sx4 + sx5 + sx6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58323 -0.47774  0.00449  0.73949  1.90494 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0668     0.1584  31.992   <2e-16 ***
## sx1           0.5102     0.2446   2.086   0.0428 *  
## sx2           0.2155     0.2491   0.865   0.3916    
## sx4           0.6843     0.4478   1.528   0.1337    
## sx5          -0.6745     0.4440  -1.519   0.1359    
## sx6           0.3289     0.1626   2.022   0.0493 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.12 on 44 degrees of freedom
## Multiple R-squared:  0.4147, Adjusted R-squared:  0.3482 
## F-statistic: 6.235 on 5 and 44 DF,  p-value: 0.000188
# I'll now drop sx2.
ols3 <- lm(y ~ sx1 + sx4 + sx5 + sx6)
summary(ols3)
## 
## Call:
## lm(formula = y ~ sx1 + sx4 + sx5 + sx6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.76129 -0.51261  0.05411  0.73231  1.82631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0668     0.1579  32.082  < 2e-16 ***
## sx1           0.6657     0.1654   4.025 0.000216 ***
## sx4           0.7261     0.4440   1.636 0.108917    
## sx5          -0.6759     0.4428  -1.527 0.133865    
## sx6           0.3404     0.1616   2.106 0.040804 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.117 on 45 degrees of freedom
## Multiple R-squared:  0.4047, Adjusted R-squared:  0.3518 
## F-statistic: 7.649 on 4 and 45 DF,  p-value: 8.619e-05
# Next, drop sx5.
ols4 <- lm(y ~ sx1 + sx4 + sx6)
summary(ols4)
## 
## Call:
## lm(formula = y ~ sx1 + sx4 + sx6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.74844 -0.59126 -0.00469  0.83218  1.95865 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.06677    0.16020  31.628  < 2e-16 ***
## sx1          0.72654    0.16285   4.462 5.22e-05 ***
## sx4          0.09459    0.16347   0.579   0.5656    
## sx6          0.34079    0.16395   2.079   0.0433 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.133 on 46 degrees of freedom
## Multiple R-squared:  0.3739, Adjusted R-squared:  0.3331 
## F-statistic: 9.158 on 3 and 46 DF,  p-value: 7.305e-05
# Now, drop sx4.
ols5 <- lm(y ~ sx1 + sx6)
summary(ols5)
## 
## Call:
## lm(formula = y ~ sx1 + sx6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71197 -0.58413 -0.01798  0.82183  2.02531 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0668     0.1591  31.854  < 2e-16 ***
## sx1           0.7313     0.1615   4.529 4.05e-05 ***
## sx6           0.3528     0.1615   2.185   0.0339 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.125 on 47 degrees of freedom
## Multiple R-squared:  0.3694, Adjusted R-squared:  0.3425 
## F-statistic: 13.76 on 2 and 47 DF,  p-value: 1.972e-05

Compare the model with stepwise selection on BIC

null=lm(y~1)
ols6 <- step(ols1,scope = list(lower = null, upper = ols1), direction="backward", k = log(n))
## Start:  AIC=35.06
## y ~ sx1 + sx2 + sx3 + sx4 + sx5 + sx6
## 
##        Df Sum of Sq    RSS    AIC
## - sx3   1    1.8176 55.181 32.190
## - sx2   1    2.3109 55.675 32.635
## - sx5   1    2.3360 55.700 32.657
## - sx4   1    2.6636 56.027 32.951
## - sx1   1    3.5029 56.867 33.694
## - sx6   1    3.9446 57.308 34.081
## <none>              53.364 35.059
## 
## Step:  AIC=32.19
## y ~ sx1 + sx2 + sx4 + sx5 + sx6
## 
##        Df Sum of Sq    RSS    AIC
## - sx2   1    0.9390 56.120 28.490
## - sx5   1    2.8938 58.075 30.202
## - sx4   1    2.9283 58.110 30.232
## - sx6   1    5.1278 60.309 32.090
## <none>              55.181 32.190
## - sx1   1    5.4566 60.638 32.361
## 
## Step:  AIC=28.49
## y ~ sx1 + sx4 + sx5 + sx6
## 
##        Df Sum of Sq    RSS    AIC
## - sx5   1    2.9063 59.027 26.472
## - sx4   1    3.3359 59.456 26.834
## <none>              56.120 28.490
## - sx6   1    5.5319 61.652 28.648
## - sx1   1   20.2026 76.323 39.321
## 
## Step:  AIC=26.47
## y ~ sx1 + sx4 + sx6
## 
##        Df Sum of Sq    RSS    AIC
## - sx4   1    0.4297 59.456 22.291
## - sx6   1    5.5443 64.571 26.417
## <none>              59.027 26.472
## - sx1   1   25.5420 84.569 39.907
## 
## Step:  AIC=22.29
## y ~ sx1 + sx6
## 
##        Df Sum of Sq    RSS    AIC
## <none>              59.456 22.291
## - sx6   1    6.0383 65.495 22.584
## - sx1   1   25.9491 85.405 35.856
summary(ols6)
## 
## Call:
## lm(formula = y ~ sx1 + sx6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71197 -0.58413 -0.01798  0.82183  2.02531 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0668     0.1591  31.854  < 2e-16 ***
## sx1           0.7313     0.1615   4.529 4.05e-05 ***
## sx6           0.3528     0.1615   2.185   0.0339 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.125 on 47 degrees of freedom
## Multiple R-squared:  0.3694, Adjusted R-squared:  0.3425 
## F-statistic: 13.76 on 2 and 47 DF,  p-value: 1.972e-05
ols7 <- step(null,scope = list(lower = null, upper = ols1), direction="forward", k = log(n))
## Start:  AIC=36.26
## y ~ 1
## 
##        Df Sum of Sq    RSS    AIC
## + sx1   1   28.7849 65.495 22.584
## + sx3   1   28.2064 66.073 23.023
## + sx2   1   24.0651 70.214 26.063
## + sx6   1    8.8740 85.405 35.856
## <none>              94.280 36.255
## + sx4   1    1.6893 92.590 39.895
## + sx5   1    0.0134 94.266 40.792
## 
## Step:  AIC=22.58
## y ~ sx1
## 
##        Df Sum of Sq    RSS    AIC
## + sx6   1    6.0383 59.456 22.291
## <none>              65.495 22.584
## + sx2   1    1.9410 63.554 25.623
## + sx3   1    0.9803 64.514 26.373
## + sx4   1    0.9236 64.571 26.417
## + sx5   1    0.0787 65.416 27.067
## 
## Step:  AIC=22.29
## y ~ sx1 + sx6
## 
##        Df Sum of Sq    RSS    AIC
## <none>              59.456 22.291
## + sx2   1   1.25260 58.204 25.770
## + sx3   1   0.63788 58.819 26.295
## + sx4   1   0.42968 59.027 26.471
## + sx5   1   0.00006 59.456 26.834
summary(ols7)
## 
## Call:
## lm(formula = y ~ sx1 + sx6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71197 -0.58413 -0.01798  0.82183  2.02531 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0668     0.1591  31.854  < 2e-16 ***
## sx1           0.7313     0.1615   4.529 4.05e-05 ***
## sx6           0.3528     0.1615   2.185   0.0339 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.125 on 47 degrees of freedom
## Multiple R-squared:  0.3694, Adjusted R-squared:  0.3425 
## F-statistic: 13.76 on 2 and 47 DF,  p-value: 1.972e-05
ols8 <- step(null,scope = list(lower = null, upper = ols1), direction="both", k = log(n))
## Start:  AIC=36.26
## y ~ 1
## 
##        Df Sum of Sq    RSS    AIC
## + sx1   1   28.7849 65.495 22.584
## + sx3   1   28.2064 66.073 23.023
## + sx2   1   24.0651 70.214 26.063
## + sx6   1    8.8740 85.405 35.856
## <none>              94.280 36.255
## + sx4   1    1.6893 92.590 39.895
## + sx5   1    0.0134 94.266 40.792
## 
## Step:  AIC=22.58
## y ~ sx1
## 
##        Df Sum of Sq    RSS    AIC
## + sx6   1    6.0383 59.456 22.291
## <none>              65.495 22.584
## + sx2   1    1.9410 63.554 25.623
## + sx3   1    0.9803 64.514 26.373
## + sx4   1    0.9236 64.571 26.417
## + sx5   1    0.0787 65.416 27.067
## - sx1   1   28.7849 94.280 36.255
## 
## Step:  AIC=22.29
## y ~ sx1 + sx6
## 
##        Df Sum of Sq    RSS    AIC
## <none>              59.456 22.291
## - sx6   1    6.0383 65.495 22.584
## + sx2   1    1.2526 58.204 25.770
## + sx3   1    0.6379 58.819 26.295
## + sx4   1    0.4297 59.027 26.472
## + sx5   1    0.0001 59.456 26.834
## - sx1   1   25.9491 85.405 35.856
summary(ols8)
## 
## Call:
## lm(formula = y ~ sx1 + sx6)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71197 -0.58413 -0.01798  0.82183  2.02531 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.0668     0.1591  31.854  < 2e-16 ***
## sx1           0.7313     0.1615   4.529 4.05e-05 ***
## sx6           0.3528     0.1615   2.185   0.0339 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.125 on 47 degrees of freedom
## Multiple R-squared:  0.3694, Adjusted R-squared:  0.3425 
## F-statistic: 13.76 on 2 and 47 DF,  p-value: 1.972e-05

Use Mallow’s Cp

library(leaps)
bestss <- regsubsets(y ~ x, data=trdata)
summary(bestss)
## Subset selection object
## Call: regsubsets.formula(y ~ x, data = trdata)
## 6 Variables  (and intercept)
##     Forced in Forced out
## xx1     FALSE      FALSE
## xx2     FALSE      FALSE
## xx3     FALSE      FALSE
## xx4     FALSE      FALSE
## xx5     FALSE      FALSE
## xx6     FALSE      FALSE
## 1 subsets of each size up to 6
## Selection Algorithm: exhaustive
##          xx1 xx2 xx3 xx4 xx5 xx6
## 1  ( 1 ) "*" " " " " " " " " " "
## 2  ( 1 ) "*" " " " " " " " " "*"
## 3  ( 1 ) "*" "*" " " " " " " "*"
## 4  ( 1 ) "*" "*" "*" " " " " "*"
## 5  ( 1 ) "*" "*" " " "*" "*" "*"
## 6  ( 1 ) "*" "*" "*" "*" "*" "*"
plot(summary(bestss)$cp, type="b")