# 1 Softmax (Multinomial) Regression

## 1.1 Introduction

It is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables. One fairly simple way to arrive at the multinomial logit model is to imagine, for K possible outcomes, running K-1 independent binary logistic regression models, in which one outcome is chosen as a “pivot” and then the other K-1 outcomes are separately regressed against the pivot outcome. This would proceed as follows, if outcome K (the last outcome) is chosen as the pivot:

$\ln {\frac {\Pr(Y_ {i}=1)}{\Pr(Y_ {i}=K)}} ={\boldsymbol {\beta }}_ {1}\cdot \mathbf {X} _{i} \\ \ln {\frac {\Pr(Y_ {i}=2)}{\Pr(Y_ {i}=K)}} ={\boldsymbol {\beta }}_ {2}\cdot \mathbf {X} _{i} \\ ... ...\\ \ln {\frac {\Pr(Y_ {i}=K-1)}{\Pr(Y_ {i}=K)}} ={\boldsymbol {\beta }}_ {K-1}\cdot \mathbf {X} _ {i}$

Suppose that $$X$$ be the covariates. The softmax regression requires that

$\Pr(Y_ {i}=1) ={\frac {e^{{\boldsymbol {\beta }}_ {1}\cdot \mathbf {X}_ {i}}}{1+\sum _ {k=1}^{K-1}e^{{\boldsymbol {\beta }}_ {k}\cdot \mathbf {X}_ {i}}}}\\ \Pr(Y_ {i}=2) ={\frac {e^{{\boldsymbol {\beta }}_ {2}\cdot \mathbf {X}_ {i}}}{1+\sum _ {k=1}^{K-1}e^{{\boldsymbol {\beta }}_ {k}\cdot \mathbf {X}_ {i}}}}\\ ... ... \\ \Pr(Y_ {i}=K-1) ={\frac {e^{{\boldsymbol {\beta }}_ {K-1}\cdot \mathbf {X}_ {i}}}{1+\sum _ {k=1}^{K-1}e^{{\boldsymbol {\beta }}_{k}\cdot \mathbf {X} _{i}}}}$

Important assumption:

• The odds of preferring one class over another do not depend on the presence or absence of other “irrelevant” alternatives.

## 1.2 Example: Softmax regression on Wine Quality

The proportional odds assumption is too restrictive in many cases. We try to use softmax regression to see whether result can be improved.

### 1.2.1 The Wine Quality Dataset

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine, and the task is to predict the quality of wines using a score between 0 and 10. The input variables in the two datasets are given in the followings:

• 1 - fixed acidity
• 2 - volatile acidity
• 3 - citric acid
• 4 - residual sugar
• 5 - chlorides
• 6 - free sulfur dioxide
• 7 - total sulfur dioxide
• 8 - density
• 9 - pH
• 10 - sulphates
• 11 - alcohol

Output variable (based on sensory data): * 12 - quality (score between 0 and 10)

red_wines <- read.csv("winequality-red.csv")
red_wines$type <- "red" white_wines$type <- "white"
data_wines <- rbind(red_wines,white_wines)
data_wines$type <- factor(data_wines$type)
colnames(data_wines)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide"
##  [7] "total.sulfur.dioxide" "density"              "pH"
## [10] "sulphates"            "alcohol"              "quality"
## [13] "type"

Divide the data set into training set and testing set

sample_index <- sample(1:nrow(data_wines),nrow(data_wines)*0.8)
wines_train <- data_wines[sample_index,]
wines_test <- data_wines[-sample_index,]

Wine quality variable Counts

round(prop.table(table(factor(wines_train$quality))),3) ## ## 3 4 5 6 7 8 9 ## 0.005 0.035 0.330 0.433 0.166 0.030 0.001 Summary of covariates round(apply(wines_train[,-c(12,13)],2,function(x) return(c(Mean=mean(x),Median=median(x),SD=sd(x)))),3) ## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides ## Mean 7.211 0.339 0.319 5.514 0.056 ## Median 7.000 0.290 0.310 3.000 0.047 ## SD 1.294 0.166 0.146 4.805 0.035 ## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates ## Mean 30.694 116.129 0.995 3.218 0.53 ## Median 29.000 119.000 0.995 3.210 0.51 ## SD 17.839 56.781 0.003 0.161 0.15 ## alcohol ## Mean 10.485 ## Median 10.300 ## SD 1.195 ### 1.2.2 Experiements require(foreign) require(nnet) require(ggplot2) require(reshape2) require(dummies) 1. Perform the Multinomial regression analysis wm <- multinom(quality ~ ., data = wines_train) ## # weights: 98 (78 variable) ## initial value 10112.895045 ## iter 10 value 6854.704059 ## iter 20 value 6509.877843 ## iter 30 value 6275.305119 ## iter 40 value 5753.599884 ## iter 50 value 5603.000991 ## iter 60 value 5566.665080 ## iter 70 value 5559.623203 ## iter 80 value 5557.114951 ## iter 90 value 5554.452997 ## iter 100 value 5552.538895 ## final value 5552.538895 ## stopped after 100 iterations summary(wm,digits = max(3)) ## Call: ## multinom(formula = quality ~ ., data = wines_train) ## ## Coefficients: ## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar ## 4 9.31 -0.787 -0.653 3.03 -0.05936 ## 5 5.87 -1.003 -4.511 3.04 -0.04712 ## 6 -14.12 -0.977 -8.460 2.35 0.00767 ## 7 44.28 -0.788 -11.487 2.22 0.07427 ## 8 17.32 -0.926 -9.915 3.07 0.10284 ## 9 -26.29 0.403 -8.968 3.58 0.11188 ## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH ## 4 -16.4 -0.08757 0.005046 16.68 -3.82 ## 5 -18.0 -0.05014 0.011228 32.34 -5.30 ## 6 -19.8 -0.03782 0.004608 44.34 -5.05 ## 7 -29.5 -0.03454 0.002805 -25.70 -4.07 ## 8 -39.0 -0.01706 0.000688 -2.85 -4.16 ## 9 -47.3 -0.00728 -0.012704 -28.34 6.33 ## sulphates alcohol typewhite ## 4 3.88 -0.425 -0.568 ## 5 3.60 -0.534 -3.844 ## 6 5.29 0.275 -3.848 ## 7 6.82 0.846 -4.073 ## 8 6.41 1.100 -3.772 ## 9 4.36 1.920 10.567 ## ## Std. Errors: ## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar ## 4 0.0942 0.127 0.51149 0.5400 0.0525 ## 5 0.4430 0.115 0.35689 0.2959 0.0489 ## 6 0.3706 0.115 0.34433 0.2766 0.0488 ## 7 0.4703 0.118 0.45351 0.3652 0.0496 ## 8 0.0780 0.135 0.75161 0.6230 0.0519 ## 9 0.0691 0.264 0.00987 0.0268 0.0986 ## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH ## 4 0.09051 0.0136 0.00586 0.0943 0.449 ## 5 0.62120 0.0111 0.00546 0.4365 0.421 ## 6 0.61723 0.0110 0.00545 0.3651 0.406 ## 7 0.05736 0.0113 0.00556 0.4640 0.428 ## 8 0.01454 0.0118 0.00616 0.0790 0.464 ## 9 0.00581 0.0432 0.01989 0.0699 0.402 ## sulphates alcohol typewhite ## 4 0.5766 0.164 0.3535 ## 5 0.2999 0.150 0.2443 ## 6 0.2566 0.148 0.2304 ## 7 0.3036 0.150 0.2560 ## 8 0.5297 0.164 0.3773 ## 9 0.0929 0.183 0.0691 ## ## Residual Deviance: 11105.08 ## AIC: 11261.08 Evaluate p-values z <- summary(wm)$coefficients/summary(wm)$standard.errors round((1 - pnorm(abs(z), 0, 1)) * 2,3) ## (Intercept) fixed.acidity volatile.acidity citric.acid residual.sugar ## 4 0 0.000 0.202 0 0.258 ## 5 0 0.000 0.000 0 0.336 ## 6 0 0.000 0.000 0 0.875 ## 7 0 0.000 0.000 0 0.134 ## 8 0 0.000 0.000 0 0.047 ## 9 0 0.127 0.000 0 0.257 ## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates ## 4 0 0.000 0.390 0 0 0 ## 5 0 0.000 0.040 0 0 0 ## 6 0 0.001 0.398 0 0 0 ## 7 0 0.002 0.614 0 0 0 ## 8 0 0.148 0.911 0 0 0 ## 9 0 0.866 0.523 0 0 0 ## alcohol typewhite ## 4 0.010 0.108 ## 5 0.000 0.000 ## 6 0.063 0.000 ## 7 0.000 0.000 ## 8 0.000 0.000 ## 9 0.000 0.000 1. Evaluate the predicted class probability on testing set predicted_prob <- predict(wm, newdata = wines_test, type = "probs", se = TRUE) head(round(predicted_prob,4)) ## 3 4 5 6 7 8 9 ## 8 0.0073 0.0542 0.6163 0.2992 0.0213 0.0017 0 ## 11 0.0009 0.0281 0.7697 0.1949 0.0062 0.0003 0 ## 12 0.0002 0.0102 0.3570 0.5134 0.1100 0.0093 0 ## 14 0.0000 0.0430 0.4089 0.4880 0.0585 0.0016 0 ## 15 0.0110 0.0074 0.7257 0.2489 0.0067 0.0002 0 ## 23 0.0001 0.0164 0.4405 0.4910 0.0504 0.0015 0 1. Use maximum quality to predict class membership & Check the predictive accuracy predicted_quality = apply(predicted_prob, 1, function(t) as.numeric(colnames(predicted_prob)[which.max(t)])) length(which(predicted_quality == wines_test$quality))/length(predicted_quality)
## [1] 0.5523077

Relax the condition of matching

length(which(abs(predicted_quality - as.numeric(as.character(wines_test$quality))) <= 1))/length(predicted_quality) ## [1] 0.9515385 # 2 Linear and Quadratic Discriminant Analysis ## 2.1 Introduction Let $$Y$$ be a categorical variable with $$K$$ levels and $$X$$ be the set of covariates. Suppose that $$\pi_{k}=P(Y=k)$$ is the prior and $$f_{k}(X|Y=k)$$ is the conditional pdf of $$X$$ given $$Y=k$$. The conditional probability of $$Y=k$$ given $$X$$ is governed by $\log P(Y=k|X)=\log\pi_{k}+\log f_{k}(X|Y=k), k=1,2,\ldots,K.$ • The linear discriminant analysis assumes that 1. $$\Sigma_{k}=\Sigma$$ for all $$k=1,2,\ldots,c$$. 2. $$f_{k}(X|Y=k) \varpropto\vert \Sigma\vert^{-\frac{1}{2}}\exp[-\frac{1}{2}(X-\mu_{k})^{T}\Sigma^{-1}(X-\mu_{k})].$$ 3. $$X$$ is assigned a class $$u$$ if $$u=\underset{k}{\arg\max}\log P(Y=k|X).$$ Note that, under this scheme, the boundaries are linear function in $$X$$. (Figure source: https://onlinecourses.science.psu.edu/stat857/node/74) • The quadratic discriminant analysis assumes that 1. $$f_{k}(X|Y=k) \varpropto\vert \Sigma_{k}\vert^{-\frac{1}{2}}\exp[-\frac{1}{2}(X-\mu_{k})^{T}\Sigma_{k}^{-1}(X-\mu_{k})].$$ 2. $$X$$ is assigned a class $$u$$ if $$u=\underset{k}{\arg\max}\log P(Y=k|X).$$ Note that, under this scheme, the boundaries are quadratic function in $$X$$. (Figure source: https://onlinecourses.science.psu.edu/stat857/node/80) ## 2.2 Example: linear discriminant analysis on Wine Quality ### 2.2.1 Data Introduction The dataset consists of 13 attributes in 3 classes of 178 wines. Apart from the class variable, the attributes are given in the followings: 1. Alcohol 1. Malic acid 1. Ash 1. Alcalinity of ash 1. Magnesium 1. Total phenols 1. Flavanoids 1. Nonflavanoid phenols 1. Proanthocyanins 1. Color intensity 1. Hue 1. OD280/OD315 of diluted wines 1. Proline wines <- read.csv("wine.data.txt") colnames(wines) <- c("class","Alcohol","Malic_acid","Ash","Alcalinity","Magesium","Total_Phenols","Flavanoids","Nonflavanoid_phenols","Proanthocyanins","Color_intensity","Hue","OD","Proline") Dividing the data into training set and testing set sample_index <- sample(1:nrow(wines),nrow(wines)*0.8) wines_train <- wines[sample_index,] wines_test <- wines[-sample_index,] Calculate pi(k), k=1,2,3 prop_class <- prop.table(table(factor(wines_train$class)))
prop_class
##
##         1         2         3
## 0.3333333 0.4042553 0.2624113

### 2.2.2 Experiments

1. Evaluate the mean for each class via aggregate function
class_mean <- t(data.matrix(aggregate(wines_train[, 2:ncol(wines_train)], list(wines_train$class), mean))) dimnames(class_mean) <- NULL 1. Evaluate sample covariances for classes wines_train1 <- wines_train[wines_train$class==1,]
wines_train2 <- wines_train[wines_train$class==2,] wines_train3 <- wines_train[wines_train$class==3,]

wcov1 <- cov(wines_train1[,-1])
wcov2 <- cov(wines_train2[,-1])
wcov3 <- cov(wines_train3[,-1])
1. Evaluate the common covariance
wcov <- ((nrow(wines_train1)-1)*wcov1 + (nrow(wines_train2)-1)*wcov2 + (nrow(wines_train3)-1)*wcov3)/(nrow(wines_train)-3)
1. Evaluate inverse of the four covariance matrices
inv_wcov1 <- solve(wcov1)
inv_wcov2 <- solve(wcov2)
inv_wcov3 <- solve(wcov3)
inv_wcov <- solve(wcov)
1. A function to return the predicted class probability given $$X$$, $$\pi$$, $$\Sigma$$ and $$\mu_i$$
log_class_prob <- function(x,p,iccov,cmean) {
y <- log(p) - 0.5*t(x-cmean)%*%iccov%*%(x-cmean)
return(y)
}
1. Prediction using linear discriminant analysis (Bayes classifier with normal and common variance)
predicted_class <- rep(0,length=nrow(wines_test))
for (i in 1:nrow(wines_test)) {
x <- t(wines_test[i,-1])
p1 <- log_class_prob(x,prop_class[1],inv_wcov,class_mean[-1,1])
p2 <- log_class_prob(x,prop_class[2],inv_wcov,class_mean[-1,2])
p3 <- log_class_prob(x,prop_class[3],inv_wcov,class_mean[-1,3])
y <- max(p1,p2,p3)
predicted_class[i] <- ifelse(p1==y,1,ifelse(p2==y,2,3))
}
length(which(wines_test$class==predicted_class))/nrow(wines_test) ## [1] 1 1. Prediction using quadratic discriminant analysis (Bayes classifier with normal) predicted_class <- rep(0,length=nrow(wines_test)) for (i in 1:nrow(wines_test)) { x <- t(wines_test[i,-1]) p1 <- log_class_prob(x,prop_class[1],inv_wcov1,class_mean[-1,1]) p2 <- log_class_prob(x,prop_class[2],inv_wcov2,class_mean[-1,2]) p3 <- log_class_prob(x,prop_class[3],inv_wcov3,class_mean[-1,3]) y <- max(p1,p2,p3) predicted_class[i] <- ifelse(p1==y,1,ifelse(p2==y,2,3)) } length(which(wines_test$class==predicted_class))/nrow(wines_test)
## [1] 0.9722222

# 3 K Nearest Neighbors

Let $$\mathcal{N}_K(\mathbf{x})$$ be the set of $$k$$ data points closest to the data point $$\mathbf{x}$$. The KNN estimates the value $$Y$$ by: $\mbox{Pr}(Y=j|\mathbf{x}) = \frac{1}{K}\sum_{i\in \mathcal{N}_K(\mathbf{x})} I(y_i=j), \ \ j=1,\ldots,J$ Finally, KNN classies $$\mathbf{x}$$ to the class with the largest probability.

(Figure source: Wu, J., Cui, Z., Sheng, V.S., Shi, Y. and Zhao, P., 2014. Mixed pattern matching-based traffic abnormal behavior recognition. The Scientific World Journal, 2014.)

## 3.1 Data Introduction

The viewer and movie rating data provided by MovieLens. Since not all movies are rated by all users, our objective is to write a function to give a predicted rating to a given movie for a specified viewer

• UserID
• MovieID
• Rating: 1-5
library(stringr)
library(dplyr)
setwd("E:\\HKU data\\Courses\\Stat_3612\\Tutorial2018\\T8")

head(round(ratings,3))
##   userId movieId rating  timestamp
## 1      1      31    2.5 1260759144
## 2      1    1029    3.0 1260759179
## 3      1    1061    3.0 1260759182
## 4      1    1129    2.0 1260759185
## 5      1    1172    4.0 1260759205
## 6      1    1263    2.0 1260759151

Improve the userId and movieId using str_pad

ratings$userId <- str_pad(ratings$userId,4,side="left",pad="0")
ratings$movieId <- str_pad(ratings$movieId,6,side="left",pad="0")

## 3.2 Experiments On Movie Rating (User-based kNN)

Find distinct users and distinct movies

users <- unique(ratings$userId) movies <- unique(ratings$movieId)
1. Define the user by user similarity matrix (0-1)

Where 0 stands for no similarity, and 1 stands for exact the same. For convenience, we set 0 on the diagonal and the ones without any intersection.

users_distance <- function(a,b) {
#find their common set of rated movies
ratingsA <- ratings[ratings$userId == a,"rating"] moviesA <- ratings[ratings$userId == a,"movieId"]
names(ratingsA) <- moviesA
ratingsB <- ratings[ratings$userId == b,"rating"] moviesB <- ratings[ratings$userId == b,"movieId"]
names(ratingsB) <- moviesB
common_movies <- intersect(moviesA,moviesB)
dAB <- sqrt(sum((ratingsA[common_movies] - ratingsB[common_movies])**2))
if (length(common_movies)==0)
{
return(0)
}
else
{
return(1/(1+dAB))
}
}

users_dist <- matrix(0,nrow=length(users),ncol=length(users))
for (i in 1:(ncol(users_dist)-1)) {
for (j in (i+1):nrow(users_dist)) {
users_dist[i,j] <- users_distance(users[i],users[j])
users_dist[j,i] <- users_dist[i,j]
}
}
rownames(users_dist) <- users
colnames(users_dist) <- users
saveRDS(users_dist, file = "users_dist.Rds")

Because it would take 15 mins to run, we prepared the user-user distance matrix

users_dist <- readRDS("users_dist.Rds")
round(users_dist[1:6,1:6],3)
##       0001  0002  0003  0004  0005  0006
## 0001 0.000 0.000 0.000 0.182 0.333 0.000
## 0002 0.000 0.000 0.245 0.159 0.267 0.000
## 0003 0.000 0.245 0.000 0.208 0.203 0.258
## 0004 0.182 0.159 0.208 0.000 0.160 0.255
## 0005 0.333 0.267 0.203 0.160 0.000 0.536
## 0006 0.000 0.000 0.258 0.255 0.536 0.000
1. Average k most similar users who have rated the given movie using KNN
suggested_rating <- function(u,m,k) {
if (m %in% ratings[ratings$userId == u,"movieId"]) { return(ratings[ratings$userId == u & ratings$movieId == m,"rating"]) } else { d <- names(sort(users_dist[u,],decreasing=TRUE)) s <- 0 count <- 0 i <- 1 while (count <= k & i <= length(d)) { if (m %in% ratings[ratings$userId == d[i],"movieId"]) {
count <- count+1
ss <- ratings[ratings$userId == d[i] & ratings$movieId == m,"rating"]
s <- s + ss
}
i <- i+1
}
s <- s/count
return(s)
}
}

Suggested rating examples

suggested_rating(users[450],movies[1000],10)
## [1] 3.727273
suggested_rating(users[350],movies[1000],15)
## [1] 3.6875