1 Softmax (Multinomial) Regression

1.1 Introduction

It is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. One fairly simple way to arrive at the multinomial logit model is to imagine, for K possible outcomes, running K-1 independent binary logistic regression models, in which one outcome is chosen as a “pivot” and then the other K-1 outcomes are separately regressed against the pivot outcome. This would proceed as follows, if outcome K (the last outcome) is chosen as the pivot:

\[ \ln {\frac {\Pr(Y_ {i}=1)}{\Pr(Y_ {i}=K)}} ={\boldsymbol {\beta }}_ {1}\cdot \mathbf {X} _{i} \\ \ln {\frac {\Pr(Y_ {i}=2)}{\Pr(Y_ {i}=K)}} ={\boldsymbol {\beta }}_ {2}\cdot \mathbf {X} _{i} \\ ... ...\\ \ln {\frac {\Pr(Y_ {i}=K-1)}{\Pr(Y_ {i}=K)}} ={\boldsymbol {\beta }}_ {K-1}\cdot \mathbf {X} _ {i} \]

Suppose that \(X\) be the covariates. The softmax regression requires that

\[ \Pr(Y_ {i}=1) ={\frac {e^{{\boldsymbol {\beta }}_ {1}\cdot \mathbf {X}_ {i}}}{1+\sum _ {k=1}^{K-1}e^{{\boldsymbol {\beta }}_ {k}\cdot \mathbf {X}_ {i}}}}\\ \Pr(Y_ {i}=2) ={\frac {e^{{\boldsymbol {\beta }}_ {2}\cdot \mathbf {X}_ {i}}}{1+\sum _ {k=1}^{K-1}e^{{\boldsymbol {\beta }}_ {k}\cdot \mathbf {X}_ {i}}}}\\ ... ... \\ \Pr(Y_ {i}=K-1) ={\frac {e^{{\boldsymbol {\beta }}_ {K-1}\cdot \mathbf {X}_ {i}}}{1+\sum _ {k=1}^{K-1}e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}}} \]

Important assumption:

  • The odds of preferring one class over another do not depend on the presence or absence of other “irrelevant” alternatives.

1.2 Example: Softmax regression on Wine Quality

The proportional odds assumption is too restrictive in many cases. We try to use softmax regression to see whether result can be improved.

1.2.1 The Wine Quality Dataset

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine, and the task is to predict the quality of wines using a score between 0 and 10. The input variables in the two datasets are given in the followings:

  • 1 - fixed acidity
  • 2 - volatile acidity
  • 3 - citric acid
  • 4 - residual sugar
  • 5 - chlorides
  • 6 - free sulfur dioxide
  • 7 - total sulfur dioxide
  • 8 - density
  • 9 - pH
  • 10 - sulphates
  • 11 - alcohol

Output variable (based on sensory data): * 12 - quality (score between 0 and 10)

red_wines <- read.csv("winequality-red.csv")
white_wines <- read.csv("winequality-white.csv")
red_wines$type <- "red"
white_wines$type <- "white"
data_wines <- rbind(red_wines,white_wines)
data_wines$type <- factor(data_wines$type)
colnames(data_wines)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "type"

Divide the data set into training set and testing set

sample_index <- sample(1:nrow(data_wines),nrow(data_wines)*0.8)
wines_train <- data_wines[sample_index,]
wines_test <- data_wines[-sample_index,]

Wine quality variable Counts

round(prop.table(table(factor(wines_train$quality))),3)
## 
##     3     4     5     6     7     8     9 
## 0.005 0.035 0.330 0.433 0.166 0.030 0.001

Summary of covariates

round(apply(wines_train[,-c(12,13)],2,function(x) return(c(Mean=mean(x),Median=median(x),SD=sd(x)))),3)
##        fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## Mean           7.211            0.339       0.319          5.514     0.056
## Median         7.000            0.290       0.310          3.000     0.047
## SD             1.294            0.166       0.146          4.805     0.035
##        free.sulfur.dioxide total.sulfur.dioxide density    pH sulphates
## Mean                30.694              116.129   0.995 3.218      0.53
## Median              29.000              119.000   0.995 3.210      0.51
## SD                  17.839               56.781   0.003 0.161      0.15
##        alcohol
## Mean    10.485
## Median  10.300
## SD       1.195

1.2.2 Experiements

require(foreign)
require(nnet)
require(ggplot2)
require(reshape2)
require(dummies)
  1. Perform the Multinomial regression analysis
wm <- multinom(quality ~ ., data = wines_train)
## # weights:  98 (78 variable)
## initial  value 10112.895045 
## iter  10 value 6854.704059
## iter  20 value 6509.877843
## iter  30 value 6275.305119
## iter  40 value 5753.599884
## iter  50 value 5603.000991
## iter  60 value 5566.665080
## iter  70 value 5559.623203
## iter  80 value 5557.114951
## iter  90 value 5554.452997
## iter 100 value 5552.538895
## final  value 5552.538895 
## stopped after 100 iterations
summary(wm,digits = max(3))
## Call:
## multinom(formula = quality ~ ., data = wines_train)
## 
## Coefficients:
##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4        9.31        -0.787           -0.653        3.03       -0.05936
## 5        5.87        -1.003           -4.511        3.04       -0.04712
## 6      -14.12        -0.977           -8.460        2.35        0.00767
## 7       44.28        -0.788          -11.487        2.22        0.07427
## 8       17.32        -0.926           -9.915        3.07        0.10284
## 9      -26.29         0.403           -8.968        3.58        0.11188
##   chlorides free.sulfur.dioxide total.sulfur.dioxide density    pH
## 4     -16.4            -0.08757             0.005046   16.68 -3.82
## 5     -18.0            -0.05014             0.011228   32.34 -5.30
## 6     -19.8            -0.03782             0.004608   44.34 -5.05
## 7     -29.5            -0.03454             0.002805  -25.70 -4.07
## 8     -39.0            -0.01706             0.000688   -2.85 -4.16
## 9     -47.3            -0.00728            -0.012704  -28.34  6.33
##   sulphates alcohol typewhite
## 4      3.88  -0.425    -0.568
## 5      3.60  -0.534    -3.844
## 6      5.29   0.275    -3.848
## 7      6.82   0.846    -4.073
## 8      6.41   1.100    -3.772
## 9      4.36   1.920    10.567
## 
## Std. Errors:
##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4      0.0942         0.127          0.51149      0.5400         0.0525
## 5      0.4430         0.115          0.35689      0.2959         0.0489
## 6      0.3706         0.115          0.34433      0.2766         0.0488
## 7      0.4703         0.118          0.45351      0.3652         0.0496
## 8      0.0780         0.135          0.75161      0.6230         0.0519
## 9      0.0691         0.264          0.00987      0.0268         0.0986
##   chlorides free.sulfur.dioxide total.sulfur.dioxide density    pH
## 4   0.09051              0.0136              0.00586  0.0943 0.449
## 5   0.62120              0.0111              0.00546  0.4365 0.421
## 6   0.61723              0.0110              0.00545  0.3651 0.406
## 7   0.05736              0.0113              0.00556  0.4640 0.428
## 8   0.01454              0.0118              0.00616  0.0790 0.464
## 9   0.00581              0.0432              0.01989  0.0699 0.402
##   sulphates alcohol typewhite
## 4    0.5766   0.164    0.3535
## 5    0.2999   0.150    0.2443
## 6    0.2566   0.148    0.2304
## 7    0.3036   0.150    0.2560
## 8    0.5297   0.164    0.3773
## 9    0.0929   0.183    0.0691
## 
## Residual Deviance: 11105.08 
## AIC: 11261.08

Evaluate p-values

z <- summary(wm)$coefficients/summary(wm)$standard.errors
round((1 - pnorm(abs(z), 0, 1)) * 2,3)
##   (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar
## 4           0         0.000            0.202           0          0.258
## 5           0         0.000            0.000           0          0.336
## 6           0         0.000            0.000           0          0.875
## 7           0         0.000            0.000           0          0.134
## 8           0         0.000            0.000           0          0.047
## 9           0         0.127            0.000           0          0.257
##   chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 4         0               0.000                0.390       0  0         0
## 5         0               0.000                0.040       0  0         0
## 6         0               0.001                0.398       0  0         0
## 7         0               0.002                0.614       0  0         0
## 8         0               0.148                0.911       0  0         0
## 9         0               0.866                0.523       0  0         0
##   alcohol typewhite
## 4   0.010     0.108
## 5   0.000     0.000
## 6   0.063     0.000
## 7   0.000     0.000
## 8   0.000     0.000
## 9   0.000     0.000
  1. Evaluate the predicted class probability on testing set
predicted_prob <- predict(wm, newdata = wines_test, type = "probs", se = TRUE)
head(round(predicted_prob,4))
##         3      4      5      6      7      8 9
## 8  0.0073 0.0542 0.6163 0.2992 0.0213 0.0017 0
## 11 0.0009 0.0281 0.7697 0.1949 0.0062 0.0003 0
## 12 0.0002 0.0102 0.3570 0.5134 0.1100 0.0093 0
## 14 0.0000 0.0430 0.4089 0.4880 0.0585 0.0016 0
## 15 0.0110 0.0074 0.7257 0.2489 0.0067 0.0002 0
## 23 0.0001 0.0164 0.4405 0.4910 0.0504 0.0015 0
  1. Use maximum quality to predict class membership & Check the predictive accuracy
predicted_quality = apply(predicted_prob, 1, function(t) as.numeric(colnames(predicted_prob)[which.max(t)]))
length(which(predicted_quality == wines_test$quality))/length(predicted_quality)
## [1] 0.5523077

Relax the condition of matching

length(which(abs(predicted_quality - as.numeric(as.character(wines_test$quality))) <= 1))/length(predicted_quality)
## [1] 0.9515385

2 Linear and Quadratic Discriminant Analysis

2.1 Introduction

Let \(Y\) be a categorical variable with \(K\) levels and \(X\) be the set of covariates.

Suppose that \(\pi_{k}=P(Y=k)\) is the prior and \(f_{k}(X|Y=k)\) is the conditional pdf of \(X\) given \(Y=k\).

The conditional probability of \(Y=k\) given \(X\) is governed by

\[\log P(Y=k|X)=\log\pi_{k}+\log f_{k}(X|Y=k), k=1,2,\ldots,K.\]

  • The linear discriminant analysis assumes that
  1. \(\Sigma_{k}=\Sigma\) for all \(k=1,2,\ldots,c\).

  2. \(f_{k}(X|Y=k) \varpropto\vert \Sigma\vert^{-\frac{1}{2}}\exp[-\frac{1}{2}(X-\mu_{k})^{T}\Sigma^{-1}(X-\mu_{k})].\)

  3. \(X\) is assigned a class \(u\) if \(u=\underset{k}{\arg\max}\log P(Y=k|X).\)

Note that, under this scheme, the boundaries are linear function in \(X\).

(Figure source: https://onlinecourses.science.psu.edu/stat857/node/74)

  • The quadratic discriminant analysis assumes that
  1. \(f_{k}(X|Y=k) \varpropto\vert \Sigma_{k}\vert^{-\frac{1}{2}}\exp[-\frac{1}{2}(X-\mu_{k})^{T}\Sigma_{k}^{-1}(X-\mu_{k})].\)

  2. \(X\) is assigned a class \(u\) if \(u=\underset{k}{\arg\max}\log P(Y=k|X).\)

Note that, under this scheme, the boundaries are quadratic function in \(X\).

(Figure source: https://onlinecourses.science.psu.edu/stat857/node/80)

2.2 Example: linear discriminant analysis on Wine Quality

2.2.1 Data Introduction

The dataset consists of 13 attributes in 3 classes of 178 wines. Apart from the class variable, the attributes are given in the followings:

    1. Alcohol
    1. Malic acid
    1. Ash
    1. Alcalinity of ash
    1. Magnesium
    1. Total phenols
    1. Flavanoids
    1. Nonflavanoid phenols
    1. Proanthocyanins
    1. Color intensity
    1. Hue
    1. OD280/OD315 of diluted wines
    1. Proline
wines <- read.csv("wine.data.txt")
colnames(wines) <- c("class","Alcohol","Malic_acid","Ash","Alcalinity","Magesium","Total_Phenols","Flavanoids","Nonflavanoid_phenols","Proanthocyanins","Color_intensity","Hue","OD","Proline")

Dividing the data into training set and testing set

sample_index <- sample(1:nrow(wines),nrow(wines)*0.8)
wines_train <- wines[sample_index,]
wines_test <- wines[-sample_index,]

Calculate pi(k), k=1,2,3

prop_class <- prop.table(table(factor(wines_train$class)))
prop_class
## 
##         1         2         3 
## 0.3333333 0.4042553 0.2624113

2.2.2 Experiments

  1. Evaluate the mean for each class via aggregate function
class_mean <- t(data.matrix(aggregate(wines_train[, 2:ncol(wines_train)], list(wines_train$class), mean)))
dimnames(class_mean) <- NULL
  1. Evaluate sample covariances for classes
wines_train1 <- wines_train[wines_train$class==1,]
wines_train2 <- wines_train[wines_train$class==2,]
wines_train3 <- wines_train[wines_train$class==3,]

wcov1 <- cov(wines_train1[,-1])
wcov2 <- cov(wines_train2[,-1])
wcov3 <- cov(wines_train3[,-1])
  1. Evaluate the common covariance
wcov <- ((nrow(wines_train1)-1)*wcov1 + (nrow(wines_train2)-1)*wcov2 + (nrow(wines_train3)-1)*wcov3)/(nrow(wines_train)-3)
  1. Evaluate inverse of the four covariance matrices
inv_wcov1 <- solve(wcov1)
inv_wcov2 <- solve(wcov2)
inv_wcov3 <- solve(wcov3)
inv_wcov <- solve(wcov)
  1. A function to return the predicted class probability given \(X\), \(\pi\), \(\Sigma\) and \(\mu_i\)
log_class_prob <- function(x,p,iccov,cmean) {
    y <- log(p) - 0.5*t(x-cmean)%*%iccov%*%(x-cmean)
    return(y)
}
  1. Prediction using linear discriminant analysis (Bayes classifier with normal and common variance)
predicted_class <- rep(0,length=nrow(wines_test))
for (i in 1:nrow(wines_test)) {
    x <- t(wines_test[i,-1])
    p1 <- log_class_prob(x,prop_class[1],inv_wcov,class_mean[-1,1])
    p2 <- log_class_prob(x,prop_class[2],inv_wcov,class_mean[-1,2])
    p3 <- log_class_prob(x,prop_class[3],inv_wcov,class_mean[-1,3])
    y <- max(p1,p2,p3)
    predicted_class[i] <- ifelse(p1==y,1,ifelse(p2==y,2,3))
}
length(which(wines_test$class==predicted_class))/nrow(wines_test)
## [1] 1
  1. Prediction using quadratic discriminant analysis (Bayes classifier with normal)
predicted_class <- rep(0,length=nrow(wines_test))
for (i in 1:nrow(wines_test)) {
    x <- t(wines_test[i,-1])
    p1 <- log_class_prob(x,prop_class[1],inv_wcov1,class_mean[-1,1])
    p2 <- log_class_prob(x,prop_class[2],inv_wcov2,class_mean[-1,2])
    p3 <- log_class_prob(x,prop_class[3],inv_wcov3,class_mean[-1,3])
    y <- max(p1,p2,p3)
    predicted_class[i] <- ifelse(p1==y,1,ifelse(p2==y,2,3))
}
length(which(wines_test$class==predicted_class))/nrow(wines_test)
## [1] 0.9722222

3 K Nearest Neighbors

Let \(\mathcal{N}_K(\mathbf{x})\) be the set of \(k\) data points closest to the data point \(\mathbf{x}\). The KNN estimates the value \(Y\) by: \[ \mbox{Pr}(Y=j|\mathbf{x}) = \frac{1}{K}\sum_{i\in \mathcal{N}_K(\mathbf{x})} I(y_i=j), \ \ j=1,\ldots,J \] Finally, KNN classies \(\mathbf{x}\) to the class with the largest probability.

(Figure source: Wu, J., Cui, Z., Sheng, V.S., Shi, Y. and Zhao, P., 2014. Mixed pattern matching-based traffic abnormal behavior recognition. The Scientific World Journal, 2014.)

3.1 Data Introduction

The viewer and movie rating data provided by MovieLens. Since not all movies are rated by all users, our objective is to write a function to give a predicted rating to a given movie for a specified viewer

  • UserID
  • MovieID
  • Rating: 1-5
library(stringr)
library(dplyr)
setwd("E:\\HKU data\\Courses\\Stat_3612\\Tutorial2018\\T8")

ratings <- read.csv("ratings.csv")
head(round(ratings,3))
##   userId movieId rating  timestamp
## 1      1      31    2.5 1260759144
## 2      1    1029    3.0 1260759179
## 3      1    1061    3.0 1260759182
## 4      1    1129    2.0 1260759185
## 5      1    1172    4.0 1260759205
## 6      1    1263    2.0 1260759151

Improve the userId and movieId using str_pad

ratings$userId <- str_pad(ratings$userId,4,side="left",pad="0")
ratings$movieId <- str_pad(ratings$movieId,6,side="left",pad="0")

3.2 Experiments On Movie Rating (User-based kNN)

Find distinct users and distinct movies

users <- unique(ratings$userId)
movies <- unique(ratings$movieId)
  1. Define the user by user similarity matrix (0-1)

Where 0 stands for no similarity, and 1 stands for exact the same. For convenience, we set 0 on the diagonal and the ones without any intersection.

users_distance <- function(a,b) {
    #find their common set of rated movies
    ratingsA <- ratings[ratings$userId == a,"rating"]
    moviesA <- ratings[ratings$userId == a,"movieId"]
    names(ratingsA) <- moviesA
    ratingsB <- ratings[ratings$userId == b,"rating"]
    moviesB <- ratings[ratings$userId == b,"movieId"]
    names(ratingsB) <- moviesB
    common_movies <- intersect(moviesA,moviesB)
    dAB <- sqrt(sum((ratingsA[common_movies] - ratingsB[common_movies])**2))
    if (length(common_movies)==0)
    {
    return(0)
    }
    else
    {
    return(1/(1+dAB))
  }
}

users_dist <- matrix(0,nrow=length(users),ncol=length(users))
for (i in 1:(ncol(users_dist)-1)) {
    for (j in (i+1):nrow(users_dist)) {
        users_dist[i,j] <- users_distance(users[i],users[j])
        users_dist[j,i] <- users_dist[i,j]
    }
}
rownames(users_dist) <- users
colnames(users_dist) <- users
saveRDS(users_dist, file = "users_dist.Rds")

Because it would take 15 mins to run, we prepared the user-user distance matrix

users_dist <- readRDS("users_dist.Rds")
round(users_dist[1:6,1:6],3)
##       0001  0002  0003  0004  0005  0006
## 0001 0.000 0.000 0.000 0.182 0.333 0.000
## 0002 0.000 0.000 0.245 0.159 0.267 0.000
## 0003 0.000 0.245 0.000 0.208 0.203 0.258
## 0004 0.182 0.159 0.208 0.000 0.160 0.255
## 0005 0.333 0.267 0.203 0.160 0.000 0.536
## 0006 0.000 0.000 0.258 0.255 0.536 0.000
  1. Average k most similar users who have rated the given movie using KNN
suggested_rating <- function(u,m,k) {
    if (m %in% ratings[ratings$userId == u,"movieId"]) {
        return(ratings[ratings$userId == u & ratings$movieId == m,"rating"])
    } else {
        d <- names(sort(users_dist[u,],decreasing=TRUE))
        s <- 0
        count <- 0
        i <- 1
        while (count <= k & i <= length(d)) {
            if (m %in% ratings[ratings$userId == d[i],"movieId"]) {
                count <- count+1
                ss <- ratings[ratings$userId == d[i] & ratings$movieId == m,"rating"]
                s <- s + ss
            }
            i <- i+1
        }
        s <- s/count
        return(s)
    }
}

Suggested rating examples

suggested_rating(users[450],movies[1000],10)
## [1] 3.727273
suggested_rating(users[350],movies[1000],15)
## [1] 3.6875