---
title: "STAT3612_2312_DataMining_5th_Tutorial"
author: "YOU Jia"
date: "2/27/2017"
output: pdf_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Smoothing Spline
Letting $\Phi$ be a matrix with $\Phi(i,j) = f_j(x_i)$, the objective function can be written as:
$$(y-\Phi\beta)^T(y-\Phi\beta)+\lambda\beta^T\Omega_\Phi\beta$$
where $\Omega_\Phi(i,j)=\int f_i^{(k+1)/2}(t)f_j^{(k+1)/2}(t) d t$
By simple calculus, the coefficients $\hat{\beta}$ that minimize the above function can be written as:
$$\hat{\beta} = (\Phi^T\Phi+\lambda\Omega_\Phi)^{-1}\Phi^Ty$$
The predicted values is a linear function of the observed values:
$$\hat{y} = \Phi(\Phi^T\Phi+\lambda\Omega_\Phi)^{-1}\Phi^Ty = S_\lambda y$$
The degree of freedom for a smoothing spline are:
$$Trace(S_\lambda) = S_\lambda(1,1) + S_\lambda(2,2) + ... + S_\lambda(n,n) $$
### Choosing the regularization parameter $\lambda$ by Cross Validation
$$RSS_{LOOCV}(\lambda) = \frac{1}{n}\sum_{i=1}^n\Big(y_i-\hat{f}^{-1}(x_i)\Big)^2 = \frac{1}{n}\sum_{i=1}^n\Big(\frac{y_i-\hat{f}(x_i)}{1-S_{ii}}\Big)^2$$
```{r}
library(MASS)
help(mcycle)
x <- mcycle$times
y <- mcycle$accel
help(smooth.spline)
```
```{r}
n <- length(unique(x))
gcv <- numeric(100)
df <- seq(2,n,l=100)
for (i in 1:100) gcv[i] <- smooth.spline(x,y,df=df[i])$cv.crit
plot(df,gcv ,type="l")
df[which.min(gcv)]
```
```{r}
plot(x,y,lwd=1,xlab='time',ylab='acceleration',main='Smoothing spline')
fit = smooth.spline(x,y)
lines(fit$x, smooth.spline(x,y,df=12.2222)$y,col=1,lwd=2)
lines(fit$x, smooth.spline(x,y,df=8)$y,col=2,lwd=2)
lines(fit$x, smooth.spline(x,y,df=5)$y,col=3,lwd=2)
lines(fit$x, smooth.spline(x,y,df=3)$y,col=4,lwd=2)
legend("bottomright", c("gvc","df=8","df=5","df=3"), col = c(1,2,3,4),lty=c(1,1,1,1))
```
# Shrinkage Methods
We have talked about variables selection methods through *AIC*, *BIC* and *Mallow's Cp*. These selection methods use least squares estimator to choose the best model and estimate test error. Here, we discuss an alternative where we fit a model containing all *p* predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. The shrinking of the coefficient estimates has the effect of significantly reducing their variance. The two best-known techniques for shrinking the coefficient estimates towards zero are the *Ridge regression* and the *Lasso*.
## Ridge Regression
Ridge regression is similar to least squares except that the coefficients are estimated by minimizing a slightly different quantity. Ridge regression, like OLS, seeks coefficient estimates that reduce RSS, however they also have a shrinkage penalty when the coefficients come closer to zero. This penalty has the effect of shrinking the coefficient estimates towards zero.
$$\hat{\beta}^{ridge} = argmin_{\beta}\{ \sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda\sum_{j=1}^p\beta_j^2\}$$
The parameter, $\lambda$, is a complexity parameter that controls the amount of the shrinking. The larger the value of $\lambda$, the greater the amount of shrinkage.$\lambda=0$ will behave exactly like OLS regression. Of course, selection a good value for $\lambda$ is critical, and should be chosen using Cross Validation techniques.
#### Why is ridge regression better than least squares?
The advantage of ridge regression is the *bias-variance trade-off*. As $\lambda$ increases, the flexibility of the ridge regression fit decreases. This leads to decrease variance, with a smaller increase in bias. Regular OLS regression is fixed with high variance, but no bias.
Ridge regression works best in situations for least squares estimates have high variance. Ridge regression is also much more computationally efficient that any subset method, since it is possible to simultaneously solve for all values of $\lambda$.
#### Notice:
* Ridge regression requires the predictors $X$ to be centered to have *mean = 0*, thus the data must be standardized before hand.
* The shrinkage does not apply to the intercept.
## Lasso
Ridge regression has at least one disadvantage that it includes all $p$ predictors in the final model. The penalty term sets many of them close to zero, but never exactly to zero. This isnâ€™t generally a problem for prediction accuracy, but it can make the model more difficult to interpret. Lasso overcomes this disadvantage and is capable of forcing some of the coefficients to zero.
LASSO (*least absolute shrinkage and selection operato*) is a shrinkage method similar to ridge regression, with subtle but important differences. The lasso coefficients aim to minimize a penalized residual sum of squares:
$$\hat{\beta}^{lasso} = argmin_{\beta}\{ \sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda\sum_{j=1}^p|\beta_j|\}$$
The $L_2$ ridge penalty $\sum_{j=1}^p\beta_j^2$ is replaced by the $L_1$ lasso penalty $\sum_{j=1}^p|\beta_j|$ The latter constraint makes the solutions nonlinear in the $y_i$, and there is no closed form expression as in ridge regression.
## Elastic Net
The elastic net is another regularized regression that linearly combines the $L_1$ and $L_2$ penalties of the lasso and ridge methods.
$$\hat{\beta}^{lasso} = argmin_{\beta}\{ \sum_{i=1}^N(y_i-\beta_0-\sum^p_{j=1}x_{ij}\beta_j)^2+\lambda(\alpha \sum_{j=1}^p|\beta_j|+(1-\alpha)\sum_{j=1}^p\beta_j^2)\}$$
Generate data
```{r}
set.seed(2017)
x1 <- runif(50, 0, 1)
x2 <- x1 + rnorm(50, 0, 0.25)
x3 <- (x1 + x2)/2 + runif(50, 0, 0.1)
x4 <- runif(50, 0, 1)
x5 <- (2*x4 + rnorm(50, 0, 0.25))/2 + runif(50, 0, 0.1)
x6 <- runif(50, 0, 1)
y <- (3 + x1 + x2 + 0.5*x3 + 0.75*x4 + 0.5*x5 + 0.5*x6 + rnorm(50, 0, 1))
x <- scale( cbind(x1,x2,x3,x4,x5,x6) )
trdata <- data.frame( cbind(x,y) )
names(trdata) <- c("sx1", "sx2", "sx3", "sx4", "sx5", "sx6", "y")
attach(trdata)
cor(trdata)
```
```{r}
# Use OLS to fit a linear model using all of predictors.
ols1 <- lm(y ~ sx1 + sx2 + sx3 + sx4 + sx5 + sx6)
summary(ols1)
# Use a backwards elimination strategy and fit a model with sx3 omitted.
ols2 <- lm(y ~ sx1 + sx2 + sx4 + sx5 + sx6)
summary(ols2)
# I'll now drop sx2.
ols3 <- lm(y ~ sx1 + sx4 + sx5 + sx6)
summary(ols3)
# Next, drop sx5.
ols4 <- lm(y ~ sx1 + sx4 + sx6)
summary(ols4)
# Now, drop sx4.
ols5 <- lm(y ~ sx1 + sx6)
summary(ols5)
```
Compare the model with stepwise selection on BIC
```{r}
null=lm(y~1)
ols6 <- step(ols1,scope = list(lower = null, upper = ols1), direction="backward", k = log(n))
summary(ols6)
ols7 <- step(null,scope = list(lower = null, upper = ols1), direction="forward", k = log(n))
summary(ols7)
ols8 <- step(null,scope = list(lower = null, upper = ols1), direction="both", k = log(n))
summary(ols8)
```
Use Mallow's Cp
```{r}
library(leaps)
bestss <- regsubsets(y ~ x, data=trdata)
summary(bestss)
plot(summary(bestss)$cp, type="b")
```
ridge regression with lm.ridge. Note lm.ridge uses sum of squared errors instead of mean squared errors, so the results have a little bit difference with glmnet()
```{r}
library(MASS)
ss = seq(10,20,by=0.2)
gcv = NULL
for (k in 1:length(ss)) {
rgfit = lm.ridge(y ~x, data = trdata, lambda =ss[k])
gcv = c(gcv, rgfit$GCV)
}
plot(ss, gcv, type='b', xlab="lambda", ylab="GCV",
main="GCV Selection of Smoothing Parameter")
abline(v=ss[which.min(gcv)],col=2,lty=3,lwd=1)
ss[which.min(gcv)]
ridge <- lm.ridge(y ~ x, lambda = 17.2)
ridge$coef
library(glmnet)
# ridge regression with glmnet
ridge=cv.glmnet(x,y, alpha=0)
plot(ridge)
ridge$lambda.min
coef(ridge, ridge$lambda.min)
```
Lasso
```{r}
# Lasso regression with glmnet
lasso=cv.glmnet(x,y, alpha=1)
plot(lasso)
lasso$lambda.min
coef(lasso, lasso$lambda.min)
ss = exp(seq(-10, 0, by=0.1))
lassofit = glmnet(x, y, lambda = ss)
par(mfrow=c(1,2))
plot.glmnet(lassofit, xvar="lambda", label=T)
plot.glmnet(lassofit, xvar="norm", label=T)
```
```{r}
hat_ridge = predict(ridge, x, s ="lambda.min")
sum((y-hat_ridge)^2)/50
hat_lasso = predict(lasso, x, s ="lambda.min")
sum((y-hat_lasso)^2)/50
hat_ols6 = predict(ols6, newdata=as.data.frame(x))
sum((y-hat_ols6)^2)/50
```