What is Data Mining?

In the meanwhile, you will learn about:


1 Toy Example 1: Regression

1.1 Data Generation

\[ y = e^{-(x-3)^2} + N(0,0.1^2), \quad x\geq 0 \]

ffun = function(x) exp(-(x-3)^2)
x = seq(0.1,4,by=0.1)
set.seed(2017)
y = ffun(x) + 0.1*rnorm(length(x))
DataX = data.frame(x,y)
plot(x,y, pch=19, main="Observational Data (Simulated)")
lines(x,ffun(x),col=1)

Related topics:

  • Data-Generating Mechanism (DGM): sigal plus noise
  • Pseudo-Random Number Generation: set.seed(), rnorm(), …

1.2 Data Exploration

summary(DataX)
##        x               y           
##  Min.   :0.100   Min.   :-0.19079  
##  1st Qu.:1.075   1st Qu.: 0.07128  
##  Median :2.050   Median : 0.35722  
##  Mean   :2.050   Mean   : 0.42017  
##  3rd Qu.:3.025   3rd Qu.: 0.82010  
##  Max.   :4.000   Max.   : 1.02446
cor(x,y)
## [1] 0.8041693
par(mfrow=c(1,2))
hist(y, main = "Histogram")
boxplot(y, main = "Boxplot")

pairs(DataX)

Related topics:

  • Exploratory Data Analysis (EDA): summary(), cor(), …
  • Basic Graphs: plot(), hist(), boxplot(), pairs(), …
  • —> Lecture 3: Data Exploration

1.3 Linear Regression

plot(x, y, pch=19)
fit1 = lm(y~x) # Linear model 
# abline(coef(fit1), col=2)
lines(x, fit1$fitted.values, col=2)
fit2 = lm(y ~ poly(x,2))  # Quadratic model
lines(x, fit2$fitted.values, col=3)
fit3 = lm(y ~ poly(x,3))  # Cubic model
lines(x, fit3$fitted.values, col=4)
fit4 = lm(y ~ poly(x,5))  # Degree 5
lines(x, fit4$fitted.values, col=5)
fit5 = lm(y ~ poly(x,10))  # Degree 10
lines(x, fit5$fitted.values, col=6)
fit6 = lm(y ~ poly(x,20))  # Degree 20
lines(x, fit6$fitted.values, col=7)
legend("topleft", c("Linear", "Quadratic","Cubic","Degree 5","Degree 10","Degree 20"), lty=1, col=c(2,3,4,5,6,7))
title(main="Polynomial Curve Fitting")

par(mfrow=c(1,2))
plot(x, y, pch=19, main="Linear Fitting")
matlines(x, cbind(ffun(x), fit1$fitted.values), col=c(1,2), lty=1)
plot(x, y, pch=19, main="Poly20 Fitting")
matlines(x, cbind(ffun(x), fit6$fitted.values), col=c(1,2), lty=1)

Related topics:

  • Linear Models
  • Polynomial curve fitting
  • Variable transformation
  • Underfitting, Overfitting
  • —> Lecture 4: Regression

1.4 Mean Squared Error

\[ {\rm MSE} = \frac{1}{n}\sum_{i=1}^n[y_i - \hat{f}(x)]^2 \]

n = length(x)
MSE = sum((y-fit1$fitted.values)^2)/n
MSE[2] = sum((y-fit2$fitted.values)^2)/n
MSE[3] = sum((y-fit3$fitted.values)^2)/n
MSE[4] = sum((y-fit4$fitted.values)^2)/n
MSE[5] = sum((y-fit5$fitted.values)^2)/n
MSE[6] = sum((y-fit6$fitted.values)^2)/n
K = c(1,2,3,5,10,20)
plot(K, MSE, type = "b", col=2, 
     xlab="Model Complexity (Polynomial Order)")

xnew = seq(0.5,4.5,by=0.1)
set.seed(9999)
ynew = ffun(xnew) + 0.1*rnorm(length(xnew))
TestX = data.frame(x=xnew, y=ynew)
plot(x, y, pch=19, xlim=c(0,4.5),ylim=c(min(y,ynew),max(y,ynew)))
points(TestX$x, TestX$y, col=4)
legend("topleft", c("Training Data", "Testing Data"), pch=c(19,21), col=c(1,4))

TestX$pred1 = predict(fit1, data.frame(x=xnew))
TestX$pred2 = predict(fit2, data.frame(x=xnew))
TestX$pred3 = predict(fit3, data.frame(x=xnew))
TestX$pred4 = predict(fit4, data.frame(x=xnew))
TestX$pred5 = predict(fit5, data.frame(x=xnew))
nnew = dim(TestX)[1]
TestMSE = sum((TestX$y-TestX$pred1)^2)/nnew
TestMSE[2] = sum((TestX$y-TestX$pred2)^2)/nnew
TestMSE[3] = sum((TestX$y-TestX$pred3)^2)/nnew
TestMSE[4] = sum((TestX$y-TestX$pred4)^2)/nnew
TestMSE[5] = sum((TestX$y-TestX$pred5)^2)/nnew
KK = c(1,2,3,5,10)
matplot(KK, cbind(MSE[1:5], TestMSE), type = "b", lty = 1, pch = 1,
        col=c(2,4), xlab="Model Complexity (Polynomial Order)", ylab="MSE")
legend("topleft", c("Training Error", "Test Error"), lty=1, pch=1, col=c(2,4))