Neural Networks in R

a gentle introduction

Introduction

In this tutorial, we will cover feed forward neural networks in R and give insight about why they are so popular these days. We will also compare the behaviour of the method with the alternative ones. Overall we will,

discuss about the architecture of neural networks and how it differs from regressions
discuss the effect of hidden layers,
types of losses for a given architecture,
show the usage of neuralnet package,
show the usage of keras package.

Getting Started

To follow up, you will need

tidyverse
neuralnet
keras

Now, call tidyverse and other packages:

library('tidyverse')

theme_set(theme_minimal())

Package installation

Up to today it was straightforward to install packages. Today, we will have a special section for it. Installing neuralnet package is very straightforward, all you need is to do whatever comes to mind first:

install.packages("neuralnet")

Keras and tensorflow, however, need some attention. Tensorflow is a package developed by Google team for multiple platforms but not for R. The R version installs a very small Python to your computer and whenever you call it, the code is sent to Python. Keras, on the other hand, is actually not a neural network package, it is an API that makes it easier to use Tensorflow. You can compare Tensorflow with Shakespearean English and Keras with its translation to modern English. More clearly, if you were using Tensorflow, you would write lines of code for a plain classification problem, with Keras you just give simple commands to add a layer, train the model certain number of epochs and evaluate the performance.

Anyways, to install Tensorflow to your system run the following lines:

install.packages("tensorflow")
library(tensorflow)
install_tensorflow()

install.package("keras")
library(keras)
install_keras()

The third line above will install miniconda, that is a minimal version of Anaconda which is a version of Python which is a snake. This will take time.

You can check if the package is working by running,

library(tensorflow)
tf$constant("Hellow Tensorflow")

## tf.Tensor(b'Hellow Tensorflow', shape=(), dtype=string)

If you see some red lines about unnecessary details and a black line, tf.Tensor(b'Hellow Tensorflow', shape=(), dtype=string) then it works. Now let’s play.

Types of Neural Networks

There are many different architectures of neural networks, Feed forward neural networks, Recurrent Neural Networks, Convolutional Neural Networks, Autoencoders, Variational Autoencoders, Transformers, Bidirectional Transformers and many more. All are derived to solve a problem. In this course, we will only cover the plain neural networks which is called Feed Forward Neural Networks.

NNets vs Regression

Regressions and Neural Networks are essentially the same, both takes a bunch of predictors as inputs and produces an output. The difference is, neural networks has one or more hidden layers. So we can say,

A neural network with no hidden layers is a (logistic) regression
A neural network with one hidden layer is a (shallow) neural network
A neural network with more than one layers is a deep neural network

So, redefine what we know to generalize from regressions to deep neural networks:

Linear Regression

\[Y = w_1X_1 + w_2X_2 + \cdots + w_nX_n = wX\]

knitr::include_graphics("images/logreg_v1.png")

Logistic Regression

\[Z = w_0 + w_1X_1 + w_2X_2 + \cdots + w_nX_n = \textbf{w}^T \textbf{X}\]

\[P = \sigma(Z)\]

where \(\sigma\) is sigmoid function which looks like,

Thus, the illustration of the model is

knitr::include_graphics("images/logreg_labelled.png")

Here, for sake of clarity, the sigmoid function is separately added into the figure. However, if the task is a classification task, it is straightforward to add a sigmoid function (i.e. fit a logistic regression). So the illustration of logistic regression is the same as linear regression.

Shallow Neural Network

\[\textbf H = w_0 + w_1X_1 + w_2X_2 + \cdots + w_nX_n = \textbf{w}^T \textbf{X}\]

\[Y = \sigma(w_0 + m_1H_1 + m_2H_2 + \cdots + m_nH_n) = \sigma(\textbf{m}^T \textbf{H})\]

knitr::include_graphics("images/feedforward.png")

Deep Neural Network

\[\textbf H^1 = w_0^1 + w_1^1X_1 + w_2^1X_2 + \cdots + w_n^1X_n = \textbf{w}_1^T \textbf{X}\]

\[\textbf H^2 = \sigma(w_0^3 + w^2_1H^1_1 + w^2_2H^1_2 + \cdots + w^2_nH^1_n) = \sigma(\textbf{w}_2^{T} \textbf{H}^1)\]

\[\textbf Y = \sigma(w_0^3 + w^3_1H^2_1 + w^3_2H^2_2 + \cdots + w^3_nH^2_n) = \sigma(\textbf{w}_3^{T} \textbf{H}^2)\]

knitr::include_graphics("images/deepNN.png")

Multiple Outputs

The above networks takes some input variables to predict a value, say the sentiment of the sentence that takes the value 1 if positive and 0 if negative. However, there could be more than one outputs, e.g. negative, neutral, positive sentiment, or 1 / 5, 2/5 , … , 5/5 stars etc. In such cases, we can use a multiple output neural networks as the one below:

knitr::include_graphics("images/multiple_output.png")

Why Neural Networks?

Sentiment Classification Example

You can remember from your assignment how we classify sentiment using machine learning; we calculate the term frequencies (TF) for each word that appears in the document (or online post):

health <- read.csv("mental_health.csv")[,-1]
rmarkdown::paged_table(head(health, 50))

cnames <- c("sugar","size","muscle","calories","calorie","squat","fitness",
              "co.op", 'university','anxiety','loss','term','mental.health','counsel',
            "IsMentalHealthRelated")
health <- health[,colnames(health) %in% cnames]

set.seed(156)
train_inds <- sample(1:nrow(health), 0.1*nrow(health))
train      <- health[ train_inds, ]
test       <- health[-train_inds, ]

glm.fit <- glm(IsMentalHealthRelated ~ . , data = train, family="binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

sort(coef(glm.fit))

##        muscle         squat       fitness      calories       calorie 
##  -196.8621627  -177.5112158  -151.4759095  -133.5882902  -120.0554017 
##          size         sugar          loss   (Intercept)          term 
##  -110.9271329   -32.3963188    -1.7191457     0.2256593     8.1076944 
##       anxiety       counsel         co.op    university mental.health 
##     9.9431176    73.5607715    79.4645130    91.7411014   135.3315114

Recall that when we fit a logistic regression model, it calculates weights for each word.

Now, let’s assume a similar but a simpler example.

Assume you have a dataset of movie comments. You are given the term frequencies of each word. You fit a logistic regression model and the weights calculated for each word are as below:

Weights:

	hate	bad	good	love	great	\(\dots\)	I
\(w\)	-2	-1	1	2	2	\(\dots\)	0

If you were classify the below comment, it would be

I hate this movie. Just hate.

\[Z = w_{hate} x_{hate} + \cdots + w_{I} x_{I} = 2\times (-2) + 1 \times 0 + \dots =-4\]

\[P(Z) = 0.018\]

Sentiment Classification

I love this movie. It was good.

\[Z = w_{love} x_{love} + \cdots + w_{I} x_{I} = 1\times 2 + 1 \times 1 + 0 + \dots =3\] \[P(Z) = 0.953\]

Problems with Logistic Regression

It cannot capture complex relations as it is simply products of words and their weights.
Simple linear combinations cannot capture negations or double negations:
- E.g. I don’t like this movie
  - Assume weight of “don’t” is -1 and weight of like is 1. Then the sentiment would be 0.
- E.g. There is nothing I hate in this movie.

Mental Health Example

library(neuralnet)

## 
## Attaching package: 'neuralnet'

## The following object is masked from 'package:dplyr':
## 
##     compute

nn <- neuralnet(IsMentalHealthRelated ~ . , train,hidden=c(5),linear.output=F)
plot(nn)

nn <- neuralnet(IsMentalHealthRelated ~ . , train,hidden=c(5,3),linear.output=F)
plot(nn)

Neural Network Elements and Training

As you may have noticed, if you have, say 10 input variables, a hidden layer of size 5 and an output layer of size 1, the number of weights to be calculated becomes (ignoring bias), 10 * 5 + 5 * 1=55. In regression, this number is 10 * 1 = 10. As the input and hidden layer sizes increase, the number increases exponentially.

Finding the optimal weights becomes very hard when you have too many parameters. Common approach is to use GDA. The tutorial will not cover the details as to how we calculate the values.

On the other hand, neural networks are universal function approximators, meaning that they can mimic all continuous variables. This sounds good, but it also tells that it can overfit due to its amazing capacity. So, when using GDA, we also keep an eye on validation loss:

Split the data into train, validation and test,
Choose a loss function,
Run GDA one epoch,
Calculate validation loss. If it reduced compared to the previous epoch (iteration), repeat step 3. If the validation loss started to increase, stop iterating.

Loss Functions

Neural Networks are generic. We usually don’t care much about the type of the task, it can be regression or classification. The only thing changes is the loss function that we use.

For regression tasks you can use

Mean Squared Error

For binary classification tasks you can use

Binary Cross Entropy Loss

For multiple-output classification tasks you can use

Cross Entropy Loss

Activation Functions

library("keras")
x <- seq(-3,3,, 50)

y <- activation_relu(x)$numpy()
ggplot() + 
  geom_line(aes(x,y)) + 
  ggtitle("ReLU")

y <- activation_elu(x)$numpy()
ggplot() + 
  geom_line(aes(x,y)) + 
  ggtitle("eLU")

y <- 1/(1+exp(-x))
ggplot() + 
  geom_line(aes(x,y)) + 
  ggtitle("Sigmoid")

y <- activation_tanh(x)$numpy()
ggplot() + 
  geom_line(aes(x,y)) + 
  ggtitle("Tanh")

Fitting Neural Networks

library(keras)
mnist <- dataset_mnist()

x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y

dim(x_train)

## [1] 60000    28    28

dim(y_train)

## [1] 60000

x_train <- x_train[1:10240,,]
x_test  <- x_test[1:2560,, ]

y_train <- y_train[1:10240]
y_test  <- y_test[1:2560]

digit <- x_train[3,28:1,1:28]

par(pty="s") # for keeping the aspect ratio 1:1
image(t(digit), col = gray.colors(256), axes = FALSE)

y_train[3]

## [1] 4

The input image is a 28 by 28 array. We need to flatten it and convert it to (28*28=) 784 features. Also, neural networks like input values in small ranges, 0 to 1 or -1 to 1. So we rescale the input into 0 and 1:

# reshape
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
# rescale
x_train <- x_train / 255
x_test <- x_test / 255

The outputs are not binary (0 and 1), they are categorical (i.e. 1,2 …,10). We need to process the output to the categorical:

y_train <- to_categorical(y_train, 10)
y_test  <- to_categorical(y_test, 10)

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dense(units = 128, activation = 'relu') %>%
  layer_dense(units = 10, activation = 'softmax')
  
summary(model)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_2 (Dense)                     (None, 256)                     200960      
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 128)                     32896       
## ________________________________________________________________________________
## dense (Dense)                       (None, 10)                      1290        
## ================================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ________________________________________________________________________________

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

history <- model %>% fit(
  x_train, y_train, 
  epochs = 10, batch_size = 128, 
  validation_split = 0.2
)

model %>% evaluate(x_test, y_test)

##      loss  accuracy 
## 0.2316186 0.9324219

SVM & Decision Tree

library(keras)
mnist <- dataset_mnist()

x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y

dim(x_train)

## [1] 60000    28    28

dim(y_train)

## [1] 60000

x_train <- x_train[1:10240,,]
x_test  <- x_test[1:2560,, ]

y_train <- y_train[1:10240]
y_test  <- y_test[1:2560]

# reshape
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
# rescale
x_train <- x_train / 255
x_test <- x_test / 255

library(rpart)
library(e1071)

dat_train <- data.frame(y=as.factor(y_train), x=x_train)
dat_test  <- data.frame(y=as.factor(y_test), x=x_test)

dim(dat_train)

## [1] 10240   785

# fit <- svm(y~ ., data=dat_train)
# save(fit, file="svmfit.rda")
load("svmfit.rda")
preds <- predict(fit, dat_test, type="class")
MLmetrics::Accuracy(preds, y_test)

## [1] 0.8910156

fit <- rpart(y~ ., data=dat_train)
preds <- predict(fit, dat_test, type="class")
MLmetrics::Accuracy(preds, y_test)

## [1] 0.5679688