Setting up Data for Nueral Networks in R

This post walks through how to set up data for a Tensorflow nueral network in R. Tensorflow is a deep learning application that was developed by Google and is now an open source project for the world to use and make contributions. Although it was originally written for use in Python, an R wrapper has been created, called “tensorflow”, that can be installed like any other R package.

I found that the combination of this Tensorflow RStudio tutorial and this Google presentation do a great job of getting newbies acclimated with the thought process, math, and coding syntax of Tensorflow nueral networks.

However, both of the abovementioned resources showcase how Tensorflow works with the “mnist” dataset that is included as part of the tensorflow package, and which is already formatted for being fed into a Tensorflow model. Unfortunately for us, real world data does not come packaged this pretty.

This tutorial will download the same “mnist” data from Kaggle's Digit Recognizer tutorial, which requires the user to parse and transform the data's labels in order to be fed into a Tensorflow model. For the sake of simplicity, this tutorial will use a single-layer nueral network. More complex models that produce greater accuracy (and use greater amounts of computing power) are explained in the Tensorflow RStudio tutorial and Google presentation linked above.

First thing to do is load our packages, download the data set from Kaggle, and review its format. If you don't yet have the necessary packages installed, use the first three lines of the code below to do so.

install.packages("data.table")
install.packages("caret")
install.packages("tensorflow")
#load the packages
library(data.table)
library(caret)
library(tensorflow)

#read the data from the Kaggle website
train_data <- fread("https://storage.googleapis.com/kaggle-competitions-data/kaggle/3004/train.csv?GoogleAccessId=competitions-data@kaggle-161607.iam.gserviceaccount.com&Expires=1498767082&Signature=OdedHTphJQnGivk14spG9rHyBoZnm2%2FvOi%2B8NueBOuSJ8norVHcqiYHOkSWdSgzxmnR0O503Zml08C6pr%2FB5VGLFHmFa3ovn7KSBkWnwvsnpoiO4%2F5pB0%2Bp9OFLaFhygOaywWoEsNRDESn9UfWMrgTQJrW3XDIMYnrEzSpudkGLB0WBdfNcmrAj5QioI12BiRODuQxh5JnR92baaAJ8it1TVKTGw0hPh1bc09W%2FS%2BNIDAEbi9GpvqRo3OB8SWU%2BDnphoHYWOUjqwO%2FHFQiCRDI%2BDbaAIdhtIAxv3VTILP63fDBiPHwhINyEGC6wPAwdZqlC06BvqjRxVSRpwL4NgcQ%3D%3D")
dim(train_data)
## [1] 42000   785
head(train_data[, 1:6, with = FALSE])
##    label pixel0 pixel1 pixel2 pixel3 pixel4
## 1:     1      0      0      0      0      0
## 2:     0      0      0      0      0      0
## 3:     1      0      0      0      0      0
## 4:     4      0      0      0      0      0
## 5:     0      0      0      0      0      0
## 6:     0      0      0      0      0      0

The data set contains 785 variables on 42,000 different images. Looking at a snippet of the first several variables, we notice that the very first variable is each image's label (i.e., the ground truth information for what digit the handwritten image represents). For a nueral network classification model, Tensorflow needs to develop weights and biases for the 784 pixel images and create nuerons for each potential classification output. Since the only options for each image are to be 0-9, 10 nuerons are needed.

Tensorflow models like to have their ground truth labels and predictor variables split into two separate matrices (similar to other modeling techniques like xgboost). So the next thing we do is separate the image pixel information from the labels.

labels1 <- train_data[, 1, with = FALSE]
images <- train_data[, -1, with = FALSE]
head(labels1)
##    label
## 1:     1
## 2:     0
## 3:     1
## 4:     4
## 5:     0
## 6:     0

Next, we turn the labels1 data into a “one-hot” encoded matrix with 10 columns where each row will only have a “1” idenifying which digit the corresponding image represents and “0” in all other fields.

#creates an empty matrix for the labels
labels <- matrix(data = 0, nrow = nrow(labels1), ncol = max(labels1)+1)

#looping function to replace the "0" of the appropriate column in each row with a "1" 
for(i in 1:nrow(labels1)) {
  labels[i, as.integer(labels1[i, 1, with = FALSE])+1] <- 1
}

head(labels)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    1    0    0    0    0    0    0    0     0
## [2,]    1    0    0    0    0    0    0    0    0     0
## [3,]    0    1    0    0    0    0    0    0    0     0
## [4,]    0    0    0    0    1    0    0    0    0     0
## [5,]    1    0    0    0    0    0    0    0    0     0
## [6,]    1    0    0    0    0    0    0    0    0     0

The image data that we are working with in Kaggle is not normalized already like the mnist data that is provided as part of the Tensorflow package. Thus, we need to normalize all of the pixel data values to lie between 0 and 1 with the function below

range01 <- function(x){(x-min(x))/(max(x)-min(x))}
images <- range01(images)

I always thought that Kaggle should call their competition's “test” data sets “validate”, since it is what competitors are scored on after using the “train” data sets to train and test their models. To this opinionative point, let's split up the images and labels data sets into training sets and testing sets so that we can test the accuracy of the model.

inTrain <- sample(1:nrow(train_data), 40000, replace = FALSE)

train_images <- images[inTrain, ]
test_images <- images[-inTrain, ]
train_labels <- labels[inTrain, ]
test_labels <- labels[-inTrain, ]

rm(train_data, images, labels)

The next block of code is considered the “construction phase” of the Tensorflow program that we are developing. The details of this section are not the focus of this particular blog post, and I would encourage readers (again) to check out the Tensorflow RStudio tutorial if they need additional explanation about what's going on here.

#placeholder for all variables of training set (784 pixels long)
x <- tf$placeholder(tf$float32, shape(NULL, 784L)) 
#placeholder for the weights of softmax
W <- tf$Variable(tf$zeros(shape(784L, 10L))) 
#placeholder for the biases of each of the 10 output options
b <- tf$Variable(tf$zeros(shape(10L))) 

#softmax model implementation using Tensorflow
y <- tf$nn$softmax(tf$matmul(x, W) + b)

#placeholder to input the correct labels of the training data (one-hot encoded)
y_ <- tf$placeholder(tf$float32, shape(NULL, 10L)) 
#cross-entropy error function
cross_entropy <- tf$reduce_mean(-tf$reduce_sum(y_ * tf$log(y), reduction_indices=1L)) 

#using Gradient Descent to optimize the cross-entropy of our training model 
optimizer <- tf$train$GradientDescentOptimizer(0.05) 
train_step <- optimizer$minimize(cross_entropy)

Now that we've completed the “construction phase” of our program, it's time to create the “execution phase”. This model uses a stochastic approach of iteratively analyzing 100 images at a time from the 40,000 image training data set, for 3,000 rounds. This is where the work that we did in separating the data sets labels from its pixel information, then transforming the labels into a matrix is required.

# Variables must be initialized by running an `init` Op after having
# launched the graph.  We first have to add the `init` Op to the graph.
init_op <- tf$initialize_all_variables()

# Launch the graph and run the ops within a temporary session.
with(tf$Session() %as% sess, {
  # Run the 'init' op
  sess$run(init_op)
  #trains 3000 times over 100 samples from the training data each time
  system.time(
    for (i in 1:3000) {
      batch <- sample(1L:nrow(train_labels), 100L, replace = FALSE)
      batch_xs <- as.matrix(train_images[batch, ])
      batch_ys <- as.matrix(train_labels[batch, ])
      sess$run(train_step,
               feed_dict = dict(x = batch_xs, y_ = batch_ys))
    }
  )
  correct_prediction <- tf$equal(tf$argmax(y, 1L), tf$argmax(y_, 1L)) 
  accuracy <- tf$reduce_mean(tf$cast(correct_prediction, tf$float32))
  print(paste0("Testing Accuracy = ",
               round(sess$run(accuracy,
                        feed_dict=dict(x = as.matrix(test_images),
                                       y_ = as.matrix(test_labels))),3)))

  predictions <- tf$argmax(y, 1L)
  pred <- sess$run(predictions, feed_dict = dict(x = as.matrix(test_images)))
})
## [1] "Testing Accuracy = 0.92"

The final note is to point out the last two lines of the Tensorflow session above. The combination of the predictions function and the pred function returns which of the 10 digits the model thinks the testing images are most probable to be. This way we can submit our predictions on new data to those that need to know.

#look at the first few predictions of the testing set
head(pred)
## [1] 1 6 6 9 8 7
#a look at confusion matrix statistics using the 'confusionMatrix' function from the 'caret' package
confusionMatrix(pred, labels1[-inTrain]$label)$table
##           Reference
## Prediction   0   1   2   3   4   5   6   7   8   9
##          0 198   0   2   0   0   4   1   0   1   0
##          1   0 200   1   1   0   3   0   2   3   0
##          2   1   1 190   3   1   1   2   2   3   0
##          3   0   2   1 184   0   7   0   2   5   3
##          4   2   0   3   0 172   3   0   4   1   3
##          5   2   1   0   4   0 158   1   0   7   1
##          6   1   2   3   2   3   3 200   0   2   0
##          7   0   1   5   1   1   0   0 176   0   4
##          8   1   1   4   6   1   4   1   0 176   1
##          9   0   0   2   3  11   3   0   6   4 186

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: