Neural Networks in R

a more detailed review

Introduction

Last week we covered the basics of neural networks but we briefly discussed the details about the training procedure and so on. In this tutorial, we will address the question in more depth. Overall we will,

summarize the activation functions and use them in different settings,
play with the size of the hidden layers to show its impact,
compare three types of tasks binary/categorical classification and regressions.

Getting Started

To follow up, you will need

tidyverse
neuralnet
keras

Now, call tidyverse and other packages:

library('tidyverse')
library(tensorflow)
library(keras)

theme_set(theme_minimal())

Elements of Neural Networks

Activation Functions

As discussed before, Activation functions are the distinctive elements of neural networks that come from neuroscience studies. Briefly, the scientists show that the interaction between one neuron with another is maintained over synapses which transmit current from one neuron to other. However, if the first one doesn’t send much current to the second one, the impact on the second is either so small or none.

To mimic this behaviour, the computational biologists and computer scientists came up with the idea of activation functions. It is very much of a simplification, but is sufficient to model the behaviour well enough.

Of course, the neural network in computer science has almost nothing to do with our brain. The brain is amazingly complex, is effected by chemicals such as antidepressants and when you eat Prozac you don’t fall in love. Artificial neural networks are agnostic to the thing called love.

Pointwise Functions

Pointwise activation functions are used in between the layers that are fully connected, deciding whether one neuron in the former layer will be enact the neuron in the second neuron in the latter layer. Again, they are not mimicking brain, but practically work better in certain situations:

library("keras")
x <- seq(-3,3,, 50)

y <- activation_relu(x)$numpy()
ggplot() + 
  geom_line(aes(x,y)) + 
  ggtitle("ReLU")

y <- activation_elu(x)$numpy()
ggplot() + 
  geom_line(aes(x,y)) + 
  ggtitle("eLU")

y <- 1/(1+exp(-x))
ggplot() + 
  geom_line(aes(x,y)) + 
  ggtitle("Sigmoid")

y <- activation_tanh(x)$numpy()
ggplot() + 
  geom_line(aes(x,y)) + 
  ggtitle("Tanh")

Distribution-wise functions

When training neural networks for classification tasks with multiple outputs (nonbinary), we need to decide which neuron in the final layer will be activated. For example, the network may run a couple of calculations and in the final layer there are 10 neurons corresponding to 10 digits. If one is correct, then we must choose one using a criteria. In such cases we use

Softmax
LogSoftmax

We will see their usage below.

Regularization and Dropout

Neural networks are universal function approximators, they can estimate any weird looking nonlinear functions. All you need is data. But one problem can easily occur is that if there is an outlier, the neural network can learn it as if it is a weird nonlinear function. This is due to model complexity that we covered before.

There are strategies to deal with overfitting. Here are some of them:

Adding regularization.
Adding a dropout layer.

We won’t detail these here within the text but will discuss it.

Input and Output Layers

Last week we only covered one example using Keras, a categorical classification task. We took images in form of 28x28, flatten them into 784x1 (as if we have 784 predictors of a house), then processed within the network and outputted into 10 neurons correspoinding to 10 digits. There are lots of going on here. Now, let’s list these:

flatten
to_categorical

Loss Functions and Output Layers

When the task is regression, we almost always use MSE
- The output of regression task can be multiple as well, you may want to predict the house price and the expected time before it is sold as an output. MSE works very well in such a case.
When the task is binary (e.g. whether or not the post is positive in sentiment), we use binary cross entropy loss.
- The output (usually) has one neuron, that takes the value between 0 and 1. We use Sigmoid activation at the end
- The output can also be multidimensional but each dimension can take values between 0 and 1. An example would be predicting the sentiment of a post along with if it contains hate-speech. Both are binary and it can be implemented using binary cross entropy.
When the task is categorical, we use categorical cross entropy loss. Image classification is a good example, e.g. predicting the digit or type of the fashion item (hat, shoes, shirt etc.)
- In this case, the output cannot be in one dimension. If the target value is 2, then we must express it in form of [0,0,1,0,0,0,0,0,0,0] where the first element corresponds to 0, second corresponds to 1 and the third to 2.

Examples

Last week we did an example. Let’s go over it with the new information above.

The mnist dataset can be called within the Keras library. It automatically downloads it:

library(keras)
mnist <- dataset_mnist()

x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y

dim(x_train)

## [1] 60000    28    28

dim(y_train)

## [1] 60000

The train set has 60000 images, each are 28 by 28. More clearly each pixel has a value between 0 (black) to 255 (white). We will use these as predictors.

To reduce the computation cost, let’s downsize the data as below:

x_train <- x_train[1:10240,,]
x_test  <- x_test[1:2560,, ]

y_train <- y_train[1:10240]
y_test  <- y_test[1:2560]

We can plot the images as below:

digit <- x_train[3,28:1,1:28]

par(pty="s") # for keeping the aspect ratio 1:1
image(t(digit), col = gray.colors(256), axes = FALSE)

y_train[3]

## [1] 4

Now, we will fit a feed forward network which takes the data in 1D. This means, we cannot input 28x28 image but must flatten it into 784 features. This is done by using array_reshape function. Since this is not a neural network course, you must copy-paste and implement to different images when necessary.

# reshape
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))

Also, neural networks like input values in small ranges, 0 to 1 or -1 to 1. So we rescale the input into 0 and 1:

# rescale
x_train <- x_train / 255
x_test <- x_test / 255

As discussed before, the MNIST dataset doesn’t have binary outputs (0 or 1), they are categorical (i.e. 1,2 …,10). We need to process the output to the categorical. Remember, the network’s output layer will 10 neurons corresponding to each digit and one will be correct. The loss function will compare the output in shape of 1x10 with the real values that must be in the same form. `to_categorical function does that for us:

y_train <- to_categorical(y_train, 10)
y_test  <- to_categorical(y_test, 10)

head(y_train)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    1    0    0    0     0
## [2,]    1    0    0    0    0    0    0    0    0     0
## [3,]    0    0    0    0    1    0    0    0    0     0
## [4,]    0    1    0    0    0    0    0    0    0     0
## [5,]    0    0    0    0    0    0    0    0    0     1
## [6,]    0    0    1    0    0    0    0    0    0     0

Now, let’s fit the following neural network:

Architecture
- Input layer of 784 neurons
- One hidden layer that has 20 neurons
- One output layer with 10 neurons
Activation functions
- relu activation between each hidden layers
- Since the output is not binary, we need a distribution-wise activation, e.g. softmax.

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 20, activation = 'relu', input_shape = c(784)) %>% 
  layer_dense(units = 10, activation = 'softmax')
summary(model)

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_1 (Dense)                     (None, 20)                      15700       
## ________________________________________________________________________________
## dense (Dense)                       (None, 10)                      210         
## ================================================================================
## Total params: 15,910
## Trainable params: 15,910
## Non-trainable params: 0
## ________________________________________________________________________________

The task is not regression, so we cannot use MSE. The output layer has 10 outputs, so we cannot use binary cross entropy. The task is a categorical classification task which demands categorical_crossentropy.

We will use optimizer_adam, which does gradient descent, but it has an additional term for momentum. Again, we won’t go into details.

Finally, to monitor how the model performs during training, we will by default use loss function. But additionally, we will monitor accuracy:

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

We will train the network 10 iterations (epochs), randomly split the data into train and validation using 80/20 split. Finally, we will use mini-batch Gradient Descent: - If we set batch_size to number of observations in the data, then it will be Batch Gradient Descent, - If we set it to 1, then it becomes Stochastic Gradient Descent, - Setting it to something in between makes it mini-batch.

history <- model %>% fit(
  x_train, y_train, 
  epochs = 10, 
  batch_size = 128, 
  validation_split = 0.2
)

model %>% evaluate(x_test, y_test)

##      loss  accuracy 
## 0.4067649 0.8785156

The test accuracy is 0.89. We can improve it by increasing the number of neurons in the hidden layer, to capture more details:

More hidden neurons

Architecture
- Input layer of 784 neurons
- One hidden layer that has 128 neurons
- One output layer with 10 neurons
Activation functions
- relu activation between each hidden layers
- Since the output is not binary, we need a distribution-wise activation, e.g. softmax.

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 128, activation = 'relu', input_shape = c(784)) %>% 
  layer_dense(units = 10, activation = 'softmax')
  
model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

history <- model %>% fit(
  x_train, y_train, 
  epochs = 10, 
  batch_size = 128, 
  validation_split = 0.2
)
model %>% evaluate(x_test, y_test)

##      loss  accuracy 
## 0.2654849 0.9195312

That sounds like an improvement.

More hidden layers

Architecture
- Input layer of 784 neurons
- Two hidden layers, one has 256, neurons the other has 128 neurons
- One output layer with 10 neurons
Activation functions
- relu activation between each hidden layers
- Since the output is not binary, we need a distribution-wise activation, e.g. softmax.

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dense(units = 10, activation = 'softmax')
  
model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

history <- model %>% fit(
  x_train, y_train, 
  epochs = 10, 
  batch_size = 128, 
  validation_split = 0.2
)
model %>% evaluate(x_test, y_test)

##      loss  accuracy 
## 0.2247830 0.9320313

Still improving.

With Regularization

Architecture
- Input layer of 784 neurons
- Two hidden layers, one has 256, neurons the other has 128 neurons
- One output layer with 10 neurons
Activation functions
- relu activation between each hidden layers
- Since the output is not binary, we need a distribution-wise activation, e.g. softmax.
Regularization
- Add L2 regularization

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dense(units = 128, activation = 'relu', kernel_regularizer = regularizer_l2(0.001)) %>% 
  layer_dense(units = 10, activation = 'softmax')
 
regularizer_l2()

## <tensorflow.python.keras.regularizers.L1L2>

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

history <- model %>% fit(
  x_train, y_train, 
  epochs = 20, 
  batch_size = 128, 
  validation_split = 0.2
)
model %>% evaluate(x_test, y_test)

##      loss  accuracy 
## 0.2463002 0.9421875

With Dropout

Architecture
- Input layer of 784 neurons
- Two hidden layers, one has 256, neurons the other has 128 neurons
- One output layer with 10 neurons
Activation functions
- relu activation between each hidden layers
- Since the output is not binary, we need a distribution-wise activation, e.g. softmax.
Dropout layer
- Add dropout with rate = 0.3

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dropout(rate = 0.3) %>% 
  layer_dense(units = 10, activation = 'softmax')
 
model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = optimizer_adam(),
  metrics = c('accuracy')
)

history <- model %>% fit(
  x_train, y_train, 
  epochs = 10, 
  batch_size = 128, 
  validation_split = 0.2
)
model %>% evaluate(x_test, y_test)

##      loss  accuracy 
## 0.2260489 0.9332031

Last week we did an example. Let’s go over it with the new information above.

The mnist dataset can be called within the Keras library. It automatically downloads it:

library(keras)
boston <- dataset_boston_housing()

x_train <- boston$train$x
y_train <- boston$train$y
x_test  <- boston$test$x
y_test  <- boston$test$y

dim(x_train)

## [1] 404  13

dim(y_train)

## [1] 404

The train set has 404 data points and 13 predictors.

Architecture
- Input layer of 13 neurons
- One hidden layer that has 64 neurons
- One output layer with 1 neuron
Activation functions
- sigmoid activation between each hidden layers
- Since the output is not categorical, we won’t add another activation.

model <- keras_model_sequential() 
model %>% 
  layer_dense(units = 64, activation = 'sigmoid', input_shape = c(13)) %>% 
  layer_dense(units = 1)
  
model %>% compile(
  loss = 'mean_squared_error',
  optimizer = optimizer_adam()
)

history <- model %>% fit(
  x_train, y_train, 
  epochs = 100, 
  batch_size = 64, 
  validation_split = 0.2
)
model %>% evaluate(x_test, y_test)

##     loss 
## 100.0379

With normalization

y_mean <- mean(y_train)
y_std  <- sd(y_train)

y_train <- (y_train-y_mean)/y_std
y_test  <- (y_test-y_mean)/y_std

model <- keras_model_sequential() 
model %>% 
  layer_batch_normalization() %>% 
  layer_dense(units = 64, activation = 'sigmoid', input_shape = c(13)) %>% 
  layer_dense(units = 1)
  
model %>% compile(
  loss = 'mean_squared_error',
  optimizer = optimizer_adam()
)

history <- model %>% fit(
  x_train, y_train, 
  epochs = 50, 
  batch_size = 64, 
  validation_split = 0.2
)
model %>% evaluate(x_test, y_test)

##      loss 
## 0.2530227

Binary Categorization Task

For binary categorization task, you can check this example:

https://tensorflow.rstudio.com/guide/tfhub/examples/text_classification/