Neural Networks in R
a gentle introduction
Introduction
In this tutorial, we will cover feed forward neural networks in R and give insight about why they are so popular these days. We will also compare the behaviour of the method with the alternative ones. Overall we will,
- discuss about the architecture of neural networks and how it differs from regressions
- discuss the effect of hidden layers,
- types of losses for a given architecture,
- show the usage of neuralnet package,
- show the usage of keras package.
Package installation
Up to today it was straightforward to install packages. Today, we will have a special section for it. Installing neuralnet
package is very straightforward, all you need is to do whatever comes to mind first:
install.packages("neuralnet")
Keras and tensorflow, however, need some attention. Tensorflow is a package developed by Google team for multiple platforms but not for R. The R version installs a very small Python to your computer and whenever you call it, the code is sent to Python. Keras, on the other hand, is actually not a neural network package, it is an API that makes it easier to use Tensorflow. You can compare Tensorflow with Shakespearean English and Keras with its translation to modern English. More clearly, if you were using Tensorflow, you would write lines of code for a plain classification problem, with Keras you just give simple commands to add a layer, train the model certain number of epochs and evaluate the performance.
Anyways, to install Tensorflow to your system run the following lines:
install.packages("tensorflow")
library(tensorflow)
install_tensorflow()
install.package("keras")
library(keras)
install_keras()
The third line above will install miniconda
, that is a minimal version of Anaconda
which is a version of Python
which is a snake. This will take time.
You can check if the package is working by running,
## tf.Tensor(b'Hellow Tensorflow', shape=(), dtype=string)
If you see some red lines about unnecessary details and a black line, tf.Tensor(b'Hellow Tensorflow', shape=(), dtype=string)
then it works. Now let’s play.
Types of Neural Networks
There are many different architectures of neural networks, Feed forward neural networks, Recurrent Neural Networks, Convolutional Neural Networks, Autoencoders, Variational Autoencoders, Transformers, Bidirectional Transformers and many more. All are derived to solve a problem. In this course, we will only cover the plain neural networks which is called Feed Forward Neural Networks.
NNets vs Regression
Regressions and Neural Networks are essentially the same, both takes a bunch of predictors as inputs and produces an output. The difference is, neural networks has one or more hidden layers. So we can say,
- A neural network with no hidden layers is a (logistic) regression
- A neural network with one hidden layer is a (shallow) neural network
- A neural network with more than one layers is a deep neural network
So, redefine what we know to generalize from regressions to deep neural networks:
Linear Regression
\[Y = w_1X_1 + w_2X_2 + \cdots + w_nX_n = wX\]
Logistic Regression
\[Z = w_0 + w_1X_1 + w_2X_2 + \cdots + w_nX_n = \textbf{w}^T \textbf{X}\]
\[P = \sigma(Z)\]
where \(\sigma\) is sigmoid function which looks like,
Thus, the illustration of the model is
Here, for sake of clarity, the sigmoid function is separately added into the figure. However, if the task is a classification task, it is straightforward to add a sigmoid function (i.e. fit a logistic regression). So the illustration of logistic regression is the same as linear regression.
Shallow Neural Network
\[\textbf H = w_0 + w_1X_1 + w_2X_2 + \cdots + w_nX_n = \textbf{w}^T \textbf{X}\]
\[Y = \sigma(w_0 + m_1H_1 + m_2H_2 + \cdots + m_nH_n) = \sigma(\textbf{m}^T \textbf{H})\]
Deep Neural Network
\[\textbf H^1 = w_0^1 + w_1^1X_1 + w_2^1X_2 + \cdots + w_n^1X_n = \textbf{w}_1^T \textbf{X}\]
\[\textbf H^2 = \sigma(w_0^3 + w^2_1H^1_1 + w^2_2H^1_2 + \cdots + w^2_nH^1_n) = \sigma(\textbf{w}_2^{T} \textbf{H}^1)\]
\[\textbf Y = \sigma(w_0^3 + w^3_1H^2_1 + w^3_2H^2_2 + \cdots + w^3_nH^2_n) = \sigma(\textbf{w}_3^{T} \textbf{H}^2)\]
Multiple Outputs
The above networks takes some input variables to predict a value, say the sentiment of the sentence that takes the value 1 if positive and 0 if negative. However, there could be more than one outputs, e.g. negative, neutral, positive sentiment, or 1 / 5, 2/5 , … , 5/5 stars etc. In such cases, we can use a multiple output neural networks as the one below:
Why Neural Networks?
Sentiment Classification Example
You can remember from your assignment how we classify sentiment using machine learning; we calculate the term frequencies (TF) for each word that appears in the document (or online post):
cnames <- c("sugar","size","muscle","calories","calorie","squat","fitness",
"co.op", 'university','anxiety','loss','term','mental.health','counsel',
"IsMentalHealthRelated")
health <- health[,colnames(health) %in% cnames]
set.seed(156)
train_inds <- sample(1:nrow(health), 0.1*nrow(health))
train <- health[ train_inds, ]
test <- health[-train_inds, ]
glm.fit <- glm(IsMentalHealthRelated ~ . , data = train, family="binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## muscle squat fitness calories calorie
## -196.8621627 -177.5112158 -151.4759095 -133.5882902 -120.0554017
## size sugar loss (Intercept) term
## -110.9271329 -32.3963188 -1.7191457 0.2256593 8.1076944
## anxiety counsel co.op university mental.health
## 9.9431176 73.5607715 79.4645130 91.7411014 135.3315114
Recall that when we fit a logistic regression model, it calculates weights for each word.
Now, let’s assume a similar but a simpler example.
Assume you have a dataset of movie comments. You are given the term frequencies of each word. You fit a logistic regression model and the weights calculated for each word are as below:
Weights:
hate | bad | good | love | great | \(\dots\) | I | |
---|---|---|---|---|---|---|---|
\(w\) | -2 | -1 | 1 | 2 | 2 | \(\dots\) | 0 |
If you were classify the below comment, it would be
- I hate this movie. Just hate.
\[Z = w_{hate} x_{hate} + \cdots + w_{I} x_{I} = 2\times (-2) + 1 \times 0 + \dots =-4\]
\[P(Z) = 0.018\]
Sentiment Classification
- I love this movie. It was good.
\[Z = w_{love} x_{love} + \cdots + w_{I} x_{I} = 1\times 2 + 1 \times 1 + 0 + \dots =3\] \[P(Z) = 0.953\]
Problems with Logistic Regression
It cannot capture complex relations as it is simply products of words and their weights.
Simple linear combinations cannot capture negations or double negations:
- E.g. I don’t like this movie
- Assume weight of “don’t” is -1 and weight of like is 1. Then the sentiment would be 0.
- E.g. There is nothing I hate in this movie.
- E.g. I don’t like this movie
Mental Health Example
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
Neural Network Elements and Training
As you may have noticed, if you have, say 10 input variables, a hidden layer of size 5 and an output layer of size 1, the number of weights to be calculated becomes (ignoring bias), 10 * 5 + 5 * 1=55. In regression, this number is 10 * 1 = 10. As the input and hidden layer sizes increase, the number increases exponentially.
Finding the optimal weights becomes very hard when you have too many parameters. Common approach is to use GDA. The tutorial will not cover the details as to how we calculate the values.
On the other hand, neural networks are universal function approximators, meaning that they can mimic all continuous variables. This sounds good, but it also tells that it can overfit due to its amazing capacity. So, when using GDA, we also keep an eye on validation loss:
- Split the data into train, validation and test,
- Choose a loss function,
- Run GDA one epoch,
- Calculate validation loss. If it reduced compared to the previous epoch (iteration), repeat step 3. If the validation loss started to increase, stop iterating.
Loss Functions
Neural Networks are generic. We usually don’t care much about the type of the task, it can be regression or classification. The only thing changes is the loss function that we use.
For regression tasks you can use
- Mean Squared Error
For binary classification tasks you can use
- Binary Cross Entropy Loss
For multiple-output classification tasks you can use
- Cross Entropy Loss
Fitting Neural Networks
library(keras)
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y
dim(x_train)
## [1] 60000 28 28
## [1] 60000
x_train <- x_train[1:10240,,]
x_test <- x_test[1:2560,, ]
y_train <- y_train[1:10240]
y_test <- y_test[1:2560]
digit <- x_train[3,28:1,1:28]
par(pty="s") # for keeping the aspect ratio 1:1
image(t(digit), col = gray.colors(256), axes = FALSE)
## [1] 4
The input image is a 28 by 28 array. We need to flatten it and convert it to (28*28=) 784 features. Also, neural networks like input values in small ranges, 0 to 1 or -1 to 1. So we rescale the input into 0 and 1:
# reshape
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
# rescale
x_train <- x_train / 255
x_test <- x_test / 255
The outputs are not binary (0 and 1), they are categorical (i.e. 1,2 …,10). We need to process the output to the categorical:
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dense(units = 10, activation = 'softmax')
summary(model)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense_2 (Dense) (None, 256) 200960
## ________________________________________________________________________________
## dense_1 (Dense) (None, 128) 32896
## ________________________________________________________________________________
## dense (Dense) (None, 10) 1290
## ================================================================================
## Total params: 235,146
## Trainable params: 235,146
## Non-trainable params: 0
## ________________________________________________________________________________
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_adam(),
metrics = c('accuracy')
)
## loss accuracy
## 0.2316186 0.9324219
SVM & Decision Tree
library(keras)
mnist <- dataset_mnist()
x_train <- mnist$train$x
y_train <- mnist$train$y
x_test <- mnist$test$x
y_test <- mnist$test$y
dim(x_train)
## [1] 60000 28 28
## [1] 60000
x_train <- x_train[1:10240,,]
x_test <- x_test[1:2560,, ]
y_train <- y_train[1:10240]
y_test <- y_test[1:2560]
# reshape
x_train <- array_reshape(x_train, c(nrow(x_train), 784))
x_test <- array_reshape(x_test, c(nrow(x_test), 784))
# rescale
x_train <- x_train / 255
x_test <- x_test / 255
library(rpart)
library(e1071)
dat_train <- data.frame(y=as.factor(y_train), x=x_train)
dat_test <- data.frame(y=as.factor(y_test), x=x_test)
dim(dat_train)
## [1] 10240 785
# fit <- svm(y~ ., data=dat_train)
# save(fit, file="svmfit.rda")
load("svmfit.rda")
preds <- predict(fit, dat_test, type="class")
MLmetrics::Accuracy(preds, y_test)
## [1] 0.8910156
fit <- rpart(y~ ., data=dat_train)
preds <- predict(fit, dat_test, type="class")
MLmetrics::Accuracy(preds, y_test)
## [1] 0.5679688