---
title: "STAT3612_2312 Data Mining 2nd Tutorial"
author: "Jason J. You"
date: "02/07/2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Exploratory Data Analysis
```{r}
library(MASS)
attach(cats)
head(cats)
dim(cats)
summary(cats)
str(cats)
```
## Data Visualization
#### Histogram
##### Histogram is a graphical display of data using bars of different heights.
```{r}
hist(Hwt, breaks = 20, freq = F,
main='Histogram of Heart Weight',
xlab='Cats Heart Weight', ylab='Density')
lines(density(Hwt), col=4, lty=1, lwd=1)
```
#### Boxplot
##### Boxplot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.
##### Boxplot of Heart Weights and Body Weights
```{r}
par(mfrow=c(1,2))
boxplot(Hwt, main='Boxplot of Heart Weight (g)', col=2, cex.main=.8)
boxplot(Bwt, main='Boxplot of Body Weight (kg)', col=3,cex.main=.8)
```
##### Boxplot of Heart Weights according to Sex
```{r}
par(mfrow=c(1,1))
boxplot(Bwt~Sex, col=c(3,4),main="Boxplot of Heart Weight")
```
#### Scatterplot
##### A scatterplot is a useful summary of a set of bivariate data (two variables), usually drawn before working out a linear correlation coefficient or fitting a regression line.
##### Scatterplot of Heart Weights VS Body Weights
```{r}
plot(Hwt~Bwt, main = "Heart Weights VS Body Weights")
abline(lm(Hwt~Bwt)$coefficients)
```
##### It seems the two attributes are correlated, and we can check the correlation
```{r}
cor(Hwt,Bwt)
```
##### Scatterplot regarding Sex
```{r}
out.F = lm(Hwt~Bwt, data = cats, subset=(Sex=="F"))
out.M = lm(Hwt~Bwt, data = cats, subset=(Sex=="M"))
plot(Bwt, Hwt, type="n", xlab="Body Weight (kg)",ylab="Heart Weight (g)",
main="Scatterplot Bwt vs Hwt")
points(Bwt[Sex=='F'], Hwt[Sex=="F"], col="red", pch=1)
points(Bwt[Sex=='M'], Hwt[Sex=="M"], col="blue", pch=2)
abline(out.M, col=4,lty=2, lwd=2)
abline(out.F, col=2, lty=1, lwd=2)
legend("topleft",c("M","F"),col=c(4,2), lty=c(2,1))
```
#### $ggplot$
##### Ues $ggplot()$ to do the same plot
```{r}
library(ggplot2)
p0 = ggplot(cats, aes(x=Bwt, y=Hwt)) +
geom_point(aes(color=factor(Sex))) +
geom_smooth(aes(color=factor(Sex)), method="lm") +
ggtitle("Bwt VS Hwt")
p0
```
##### Use $qplot()$ to do the same plot, can you find out the difference? (Interactive plot)
```{r}
library(plotly)
p1 = qplot(Bwt, Hwt, data=cats,geom=c("point","smooth"), colour=Sex, main = "Hwt vs Bwt")
ggplotly(p1)
p2 = qplot(Hwt, data=cats, geom="density", colour=Sex, main = "Density Plot of Heart Weights")
ggplotly(p2)
p3 = qplot(Sex, Hwt, data = cats, geom="boxplot",colour=Sex, main="Boxplot of Heart Weights")
ggplotly(p3)
```
## Data Manipulation
##### $apply(X, MARGIN, FUN, ...)$
##### Returns a vector or array or list of values obtained by applying a function to margins of an array or matrix.
```{r}
tmp = cats[,-1]
head(tmp)
apply(tmp, 2, 'mean')
apply(tmp, 2, 'sd')
```
#### Filter rows with filter()
##### $filter()$ allows you to select a subset of rows in a data frame. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame:
```{r}
library(dplyr)
tmp0 = filter(cats, Sex == 'F', Hwt > 10.5)
tmp0
```
##### The following verbose code in base R is equivalent to use $filter()$
##### Need to know the boolean operators: "&" and "|"
```{r}
tmp1 = cats[Sex == 'F'& Hwt > 10.5,]
tmp1
index = which(Sex == 'F'& Hwt > 10.5)
index
cats[index,]
```
#### Arrange rows with arrange()
##### $arrange()$ works similarly to $filter()$ except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
```{r}
arrange(tmp0, Hwt)
arrange(tmp0, desc(Hwt))
```
##### Use $order()$ to get the same results
```{r}
order(tmp0$Hwt)
tmp0[order(tmp0$Hwt),]
tmp0[order(tmp0$Hwt, decreasing = TRUE),]
tmp0[order(-tmp0$Hwt),]
```
#### Add new columns with mutate()
##### Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. This is the job of mutate():
```{r}
tmp = mutate(cats, Ratio = round(Hwt/Bwt,1), Diff = abs(Hwt-Bwt))
head(tmp)
p1 = qplot(Diff, Ratio, data=tmp,geom=c("point"), colour=Sex, main = "Diff vs Ratio")
ggplotly(p1)
```
#### Grouped operations
##### become really powerful when you apply them to groups of observations within a dataset. In dplyr, you do this by with the $group_by()$ function. It breaks down a dataset into specified groups of rows. When you then apply the verbs above on the resulting object they’ll be automatically applied “by group”. Most importantly, all this is achieved by using the same exact syntax you’d use with an ungrouped object.
```{r}
tmp$Hwt = round(tmp$Hwt)
tmp = arrange(tmp, Hwt)
summarise(group_by(tmp, Sex, Hwt), mean(Ratio))
```