Various machine learning algorithms, as key parts of Artificial Intelligence (AI), have been further developed as data size and computer resources have increased. Statistical models and algorithms perform a specific task effectively depending on patterns and inference. Deep Neural Networks (DNN), decision tree based methods including Random Forest (RF) and Extreme Gradient Boosting (XGBoost), and Support Vector Machines (SVM) are remarkable for realworld regression and classification problems.
Here, we focus on deep neural network also known as ‘Deep Learning’. Deep learning is a combined approach to machine learning that has drawn on the knowledge of the human brain, statistics, and applied math (Goodfellow,
The deep learning method was not preferred in the past because it didn’t work well; however, it is very widely used now due to several breakthroughs that have grown and made deep learning popular. We summarize a brief history of deep learning, starting from the 1940s.
Deep Learning appeared for the first time in the 1940s, and was implemented in its basic form until the early 1980s.
 McCulloch and Pitts (1943) created a simple linear model based on a human brain and introduced the concept of ANN for the first time.
 Rosenblatt (1958) implemented a probabilistic model named Perceptron, training a single neuron to calculate linear combination of inputs and weights. However, the perceptron was not able to learn complex patterns.
From 1980s to 2006, the evolution of deep learning can be summarized as MLP and Back Propagation, ReLU activation and the introduction of Convolutional Neural Network (CNN).
 Rumelhart
 LeCun
The true concept of ‘deep learning’ starting around 2006 by Hinton
With its growing history, deep learning is commonly used for realworld applications such as Image Classification, Natural Language Processing (NLP), and other AI studies of computer vision.
In this paper, we consider deep learning algorithms as parts of supervised learning and try to explain them from a statistical viewpoint. It is not based on inferences, but it is a statistical learning method that predicts a function based on data. Many developers have shown achievements about prediction accuracy and efficient usage of computer resources, but it is hard for practitioners to fully understand how parameters are used mathematically to compute a loss function and why estimation works well. We sum up the topics on the basis of statistical learning theory and associate them to functions of Keras to help users become more familiar to use. The following descriptions and examples are based on statistical regression and classification models that learn features and patterns from input data and estimate corresponding targets.
The remaining of the paper is organized as follows. In Section 2, we first explain the basic structures and learning procedures of DNN, including CNN. We also describe how parameter estimation is done in detail. There are millions of parameters in a deep network and it can be difficult to estimate them appropriately. In Section 3, we explain some advanced techniques to prevent some problems and increase prediction accuracy. In Section 4, we discuss the results of applying advanced CNN models to image classification problems using two typical data sets; MNIST and CIFAR10. Section 5 provides the concluding remarks.
In this section, we describe overall learning procedures of the general deep neural network. First we investigate essential steps while training a model and explain the whole procedures systematically to understand what a model does to predict targets. We also explain how parameters are estimated. At the end of this section, we pay attention to a Convolutional Neural Network and describe the basic concepts.
A network’s learning can be expressed as finding an optimal set of weights (and biases) where each single weight and bias is used for computation between layers. The parameter set of a network is composed of all weights and biases in a model. Here “optimal” means that they have the minimum loss defined by a model and the algorithm tries to update the weights repeatedly.
The entire training procedure can be summarized to repeat following loop for the fixed number of iterations, defined by
Draw a fixed size of minibatch training sample (
Along the entire network, transform input data to highlevel features by multiplying weights and going through activation functions.
Calculate the mean loss using true and estimated values of output on the batch.
Compute a gradient of the loss to each weight, on the way back from output layer to input layer, and update the weights in the direction of decreasing the loss.
Draw a next minibatch sample and iterate.
Usually, a loss is generally defined as a mean squared error for regression, and a categorical cross entropy for classification. We will explain it later in this section using a matrix form.
We can set initial random values at all parameters (weights or bias) in the network once we define our own network structure. These random values can be drawn from certain probability distributions, including normal and uniform. Setting the initial values should be considered carefully because it might have significant effects on the following learning procedures.
For example, suppose that we set them all zero or a certain identical constant. The initial learning step uses those parameters for computation in all layers from input to output. If they are all zeros, the initial outputs from the network become zeros and goodfornothing for the following steps. If they are all same constants, they act like just one neuron regardless of the number of nodes and neurons in the network. Either way, learning cannot be improved properly. So we need to choose different values in a reasonable range. The choice of the initial values depends on activation functions in the network. Figure 1 and Table 1 represent three commonly used activation functions; sigmoid, tanh, and ReLU.
The function outputs for
We need to set a proper range of dynamic initial weights to prevent those potential problems. In general, the uniform or normal distributions with mean zero are the most popular probability distribution for initial values. The critical point is the variance of the distribution and there are some empirical initial distributions of parameters, which depends on variance. Xavier’s initialization is now commonly used to generate random values setting the variance on the basis of the number of both input and output nodes in each two adjacent layers. They presented an idea that letting the Jacobian of weight matrix in each layer close to 1 can make training faster and easier (Glorot and Bengio, 2010). Xavier’s normalized initialization consider the variance of weights from both forward and backward propagation perspectives; this let the variance of activation values and the variance of the gradients fixed to 1 avoid decreasing. Hence, it sets the variance of weights between
He’s initialization (He
In Keras for the realworld applications, the default weight initializers (set by ‘
Given the initial weights, the batch size of training sample enters the network and the network calculates the outputs following sequential layers from beginning to the end. Each layer does output =
These are notations for our quick example in this section.
*One might consider using all training data at once for each iteration; however, we employed more common minibatch training that uses m observations at one iteration. More details about the batch size are in Section 4.
INPUT → HIDDEN[1]
In each layer’s linear combination, a bias vector has a length
HIDDEN[1] → HIDDEN[2]
HIDDEN[2] → OUTPUT
LOSS
Calculate a loss for each observation
Many machine learning algorithms use Gradient Descent (GD) methods to update parameters to gradually reduce the objective function value and reach the optimum in the end. DNN also uses this powerful method. Loss function (E) defined as an objective in the deep network is a differential function for a parameter set of individual weights (W). Therefore, it is possible to compute the differentials of the loss, or gradient vector for weights in each layer. The model can update the weight in the descent direction of the loss after obtaining a gradient for a certain weight making up the loss as follows.
The model has to calculate millions of gradients and update all parameters using GD per one iteration. Back Propagation makes this computation very effective. Right after a forward pass is done, starting from the output layer’s loss, the model passes on each gradient loss to the previous layer that is toward input layer. It greatly decreases the learning time.
Depending on the number of observations using at a time, backward passing method can be distinguished as follows.



After updating all weights in the network, the network inputs ‘next’ minibatch sample and repeat Step B–C. In each iteration, forward step draws a fixed size of observations and computes output values for those observations and the following mean loss using the most recently updated weights. Backward step computes gradients to that loss for the weights and propagate them back as well as updates all weights. The entire training procedure for a fixed size of epochs is to adjust weights repeatedly by performing forward and backward propagation while expecting loss decreases.
CNN, as a specifically defined structure of DNN, uses conceptually identical training and parameter estimation steps. The difference lies in what happens in ‘Convolution’ and ‘Pooling’ layers. Each layer of CNNs conducts different tasks.
Suppose an input of CNN is an image with a fixed size.



A typical CNN structure for a classification stacks one (or more) convolutional layers and pooling layers alternately which conduct feature extraction and subsampling. The final output features from those stacked blocks are flattened and fullyconnected (FC) with all nodes of the output layer and conduct classification. One can consider adding more than one FC hidden layers right before the final output layer in order to improve classification performance. Figure 3 simplifies this structure. This typical structure is based on
LeCun
The goal of training CNN is to extract important features from input images that decide a unique class of each image by learning all weights in the filters. If we use a large filter, the model tries to find features in a large area of the input at each computation, whereas setting the filter size small makes the network view a small area. This is one of the tuning parameters that enables setting the different filter size depending on what we expect from the model.
CNN has a great advantage distinct from MLP. It can drastically reduce the number of parameters to estimate by sharing filters at different local regions of the input. A filter visits all parts of an image and performs identical computation. Moreover, it regards several features (pixels) as one feature of a local region, rather than considering each feature different.
In this section, we illustrate an example of 2D CNN for image classification using the Keras context to see the structure of CNN. Suppose each input is a grayscaled image with the height and width of 28 pixels each. Each pixel has one single channel and the input has a shape (28, 28, 1) that is the input shape of MNIST images. The number of channels are 3 if we consider color images with RGB values. Let the number of unique classes 10. We consider a CNN structure with two convolutionpooling layers repeated.
With some hyperparameters defined in Table 2, Table 3 and Figure 4 summarize the model structure of our example. We suppose that each activation function of convolutional layer is
We will explain several options that make the structure diverse in detail.
(Zero) ‘
‘
The
The illustration of computation in Figure 4 represents the matrix transformation from an input image array with single channel to highlevel feature maps with 64 channels. Table 3 shows that the number of parameters in conv2D and dense layers varies from layer to layer. However, the number of parameters in pooling and flatten layers are zero. It is because pooling layers and flatten layer does not contain any parameters to estimate. These layers just reduce and summarize features, or rearrange them in a row.
CNNs are designed to deal with 2D images; however, they can also accept 1D shaped input data such as a timeseries, a text that includes a series of words, and other sequential data. The algorithm is identical to 2D CNN except that the filters also have 1D shapes. The filters in 1D CNN slide an input sequence in one direction toward right, and compute an activated value for each local subsequence.
Table 4 and Table 5 summarize an example of 1D CNN for a binary classification with two classes. It can be used for a case of text sentiment classification where 0 and 1 indicates negative and positive respectively. You can see that the only difference between 2D CNN and 1D CNN are the shapes of inputs and filters. Note that the objective loss function for this example is binary crossentropy, and the last activation function is set to
In a large deep neural network for a supervised learning, the model contains a huge number of parameters and is often very complex. The performance sometimes becomes poor on a validation or a holdout test set if the model is too complex. This ‘overfitting’ problem frequently occurs in many machine learning models. Figure 5 shows that the loss of the training dataset decreases continuously while the loss of the validation set decreases at early steps and starts to increase after a certain point.
The parameter estimation procedure relies on initial weights; consequently, the hyperparameters should be tuned by using some grid search or random search such as Bayesian optimization. When training a large network, the learning procedure can easily go the wrong way.
Recently, some advanced techniques have been developed for training a large neural network to overcome difficulties and improve prediction performance. We can use those techniques by adding an extra step between some training steps in the layers. In this section, we describe two main techniques that greatly contribute to the development of deep learning; Dropout and Batch Normalization.
Dropout was presented in 2012, by Hinton
A dropout procedure randomly omits (or sets to 0) some nodes in a certain hidden layer at each update on each training epoch. The choice of which node to drop out is randomly decided by the probability
Suggesting this designed technique, Hinton research group reported some experimental results on seven datasets that included image data sets such as MNIST, CIFAR10, and ImageNet. They found that dropout improved performances on all data sets compared to neural networks without dropout (Srivastava
It can be generally used in a deep neural network, for various applications including image classification, speech recognition, and document classification. In the following section 4, we apply the dropout technique for a case study of 2D CNN on MNIST dataset to see how well it improves the classification performance.
Many applications of deep learning in various domains use generalized stochastic gradient descent with a batch size of sample at a time, to reduce the computation complexity and obtain an effective learning. However, training those deep networks is usually difficult because the loss tends to fluctuate for every update. Ioffe and Szegedy (2015) claimed that “While stochastic gradient is simple and effective, it requires careful tuning of the model hyperparameters, specifically the learning rate used in optimization, as well as the initial values for the model parameters”. They developed the Batch Normalization in 2015 and achieved more than 95% of the test accuracy in ImageNet classification. It is widely used now in many advanced CNN architectures, and Keras provides some applications of those architectures such as ResNet50, Inception V3, and Xception (Chollet, 2017).
As we described in the matrix form in Section 2, all the inputs of each layer except for the first input layer are the activated values (notated as
In Keras, we can separate an activation layer from a convolutional layer or a fullyconnected layer and put a batch normalization layer between them.
This section aggregates main concepts and techniques in CNN models and apply them to two widely used datasets for image classification study. We first use a standard MNIST handwritten digits digits (LeCun
Keras interface provides some examples for training neural networks that include MLP, CNN, and RNN on real datasets that demonstrate codes in R. Among them, we made a convolutional network following a simple CNN implementation example on the MNIST dataset. Table 6 shows that the model has two conv2D layers and one maxpooling layer successively. It then contains an additional hidden dense layer right after the values are flattened. Both before the flatten layer and after the hidden layer, the model also has dropout steps to avoid potential overfitting. All images have an identical shape of (28, 28, 1). The first two dimensions is the size of the images, and the last dimension is one because they are grayscaled. The digits are located in the middle of the images. The shape of outputs for each convolutional layer become gradually smaller and deeper because we did not pad the images. We trained the model for 12 epochs, using batch size 128 as it is in the example, and saved the last model after the final epoch is ended. We randomly splitted 10,000 validation set from 60,000 training set by using an option ‘
We trained another model that has an identical structure just without all dropout layers. We also splitted 20% of the original training set at random in the fitting context. Table 7 summarizes evaluated statistics for each data set. The model with dropout layers shows similar accuracies on the training and test set. However, the training accuracy of the model without dropout layers is close to one while the test accuracy is 98.88%. We could expect that those dropout layers helped the network not to overfit on the training set.
We also applied several machine learning models on the same data set in order to verify high performances of the trained CNNs above; however, we have to use a different data structure from CNNs when fitting these models. Each pixel value becomes an independent feature of an image. Therefore, we used 786 predictor variables including 784 different pixel values and two summary statistics  mean and standard deviation of all pixels. We tuned all three models using crossvalidation. We adjusted some parameters in the Random Forest that included the number of randomly sampled as candidates at each split and selected the best values by comparing OOB error rates. In XGBoost and KNN, we used 2fold CV.
Figure 7 illustrates error rates of five different models on the same test images. Even the CNN without dropout procedures performs better than other three models.
CIFAR10 (Krizhevsky
In this case, we started from a basic structure that has three conv2Dpooling blocks and tried to find the optimal structure that can perform well with unknown test images. Each block has one or more convolutional layers and one maxpooling layer successively. We randomly divided 40,000 training images and 10,000 validation images from the original training images. Then monitoring the validation loss and accuracy, we adjusted the structure of the network in various ways; adding more convolutional layers in each block, adding fullyconnected hidden layers before the output layer, or applying batch normalization technique on each convolutional layer and dropout after a hidden dense layer to prevent overfitting. We fixed all dropout rates to 0.1.
We fixed other hyperparameters as described in Table 8. In each convolutional layer, we used zeropadding to maintain output shapes regardless of the number of convolutional layers in a block, and used ReLU activation. We also used ‘He Uniform’ to draw initial weights in each convolution and dense layer. It can be adjusted in Keras layers using ‘
There are several useful callbacks during training to save the best model which has the optimal monitoring statistics and stop training at certain early epoch.


Above example functions specify a filepath and monitor the validation loss at each end of epoch. If current validation loss decreased from previous epoch, current model is saved to the filepath. Otherwise, the next epoch begins. The last saved point would be 10th epoch if the validation loss is minimum at 10th epoch and does not decrease for the next 20 epochs. Therefore, the entire learning is stopped at 30th epoch by the early stopping criteria. In our experiment, we trained eight different models changing structures and saved each model by monitoring the validation accuracy for 20 epochs as patience (Table 9).
We usually set 100 epochs for total and 20 epochs for patience, and used 300 epochs and 50 epochs each for deeper networks that contained fullyconnected hidden layers.
Figure 9 and Table 9 shows the results of error rates on the 10,000 test images. We using the validation images to select the best model in each structure and compared the performances on the test images. In summary, test error decreases as the number of convolutional layers per block increases. The batch normalization techniques in convolutional layers are also helpful. However, additional deep hidden dense layers do not improve the test performance consistently.
Model M6 and M7 both seem to be the best model. The only difference in two models is if the model contains an additional hidden layer (and following dropout layer). Because test performances are very similar, we selected the simpler model M6 for our final model. It has three repeated blocks of three convolutional layer with one maxpooling layer and each convolutional layer conducts batch normalization before activates the values.
We also compared our final CNN model with other machine learning methods, Random Forest, XGBoost, KNN. For those methods, we used 32 * 32 = 1024 pixel values and their mean, standard deviation for each color channel. Therefore, an image has (1024 + 2) * 3 = 3078 predictor variables that make a significantly large dimension of features. It also involved a lengthy training time. It took about half a day to train the XGBoost model with tuning, while the CNN models took less than half an hour in GPU on average (For the most complex CNN, we needed about two minutes per epoch in CPU and less than 30 seconds in GPU). In each machine learning method, we tuned the models using the same validation set as our CNNs.
Figure 10 compares the performances on the same 10,000 test images. Unlike MNIST, the gap between the CNN and other methods is large. Color images have very complex data structures, and thus CNN gets more advantages of reducing the number of parameters and capturing local regions at once.
Increases in computer resources and data size have resulted in more people paying attention to deep learning. Keras helps those people quickly get used to deep learning methods by using some core functions. Deep neural network is useful for supervised learning with unconventional data like images and texts. We organized the concepts in deep learning for statisticians in terms of parameter estimation procedures and advanced techniques that make a model performs better. We describe the computation of a deep network in the matrix form to understand its procedure more clearly. We also focused on the convolutional neural network which has top honor in dealing with image data and found that it performs better than other machine learning methods that cannot learn local features.
We distinguished parameters to be estimated and some main hyperparameters to be predefined before fitting a model in Table 10, particularly in the convolutional, pooling, and dense layer. Among them, the number of units (or filter) and kernel size decide the number of parameters and their shapes in the layer, while others such as dropout rate, initial values, optimizers, epochs, batch size, and patience of early stopping affect the progress of training.
Users can vary their models from three main perspectives.
 Hyperparameters in each layer that contains parameters to estimate
 Initializers (distributions) for sampling random values of parameters before begin training
 Optimizers (algorithms) for updating parameters and their options such as a learning rate
In our experiments in CIFAR10, we varied the number of layers concentrating on the depth of CNN models. VGGNet (Simonyan and Zisserman, 2014) also emphasizes the depth of CNN as an important designed aspect with other parameters fixed. Zhang and Wallace (2015) varied the number and size of filters, fixing the number of convolutional layers to 1 and applied models on seven datasets for Sentence Classification. They found the number and size of filters could have important effects and should be tuned. Users can construct their CNN models with some experiments on the depth of layers, number of filters and their size.
We used
We found that CNN learns highlevel features as the network has more stacked convolutional layers. Therefore, we might stack more convolutional layers to improve the prediction accuracy. We also found that a network with more stacked convolutional layers learns highlevel features from input images and estimates output values on CIFAR10 images more accurately. To improve the model performance, we might stack more convolutional layers deeper rather than add fullyconnected layers at the end of the network. Our objective is to understand the network more clearly rather than obtain the best accuracy; however, there are lots of large convolutional networks that have achieved outstanding accuracy in image classification on MNIST and CIFAR10. For example,
We also verified the advantages of two advanced techniques. In detail, dropout helped the model estimate parameters not excessively focusing on the training set and batch normalization helped the model achieve better prediction accuracy. All the R codes of our experiments are written using Keras and accessible on the web (http://home.ewha.ac.kr/~josong/CNNCSAM.html). It covers defining, training, and evaluating the models, as well as transforming data shapes to fit each model we used in this paper. It would be helpful for someone who wants to fit a CNN for their own study.
Three widely used activation functions.
A diagram of forward pass with two hidden layers.
A typical structure of deep CNN between input and output layer.
Example: An illustration of a computation from an input image to output.
Example: Evaluated losses by epoch on both a training and validation set in a CNN model.
MNIST; Training history of the model in Table 6.
MNIST; Test error performances of five different machine learning models.
CIFAR10; Sampled images of ten different classes.
CIFAR10; Test error rates of all eight CNN models.
CIFAR10; Test error performances of four different machine learning models.
Three commonly used activation functions
Function  Equation 

Sigmoid  
ReLU  
tanh 
Example: Hyperparameters setting in each conv2D layer
Convolutional layer  Number of filters  Kernel size 

1  32  (3, 3) 
2  64  (3, 3) 
Example: Summary of 2D CNN model structure
Layer(options)  Output shape  Number of parameters 

(None, 28, 28, 1)    
(None, 26, 26, 32)  320  
(None, 13, 13, 32)  0  
(None, 11, 11, 64)  18496  
(None, 5, 5, 64)  0  
(None, 1600)  0  
(None, 10)  16010 
Example; Hyperparameters setting in each conv1D layer
Convolutional layer  Number of filters  Kernel size 

1  32  3 
2  64  3 
Example; Summary of 1D CNN model structure
Layer(options)  Output shape  Number of parameters 

(None, 200, 1)    
(None, 198, 32)  128  
(None, 99, 32)  0  
(None, 97, 64)  6208  
(None, 48, 64)  0  
(None, 3072)  0  
(None, 1)  3073 
MNIST; Summary of the CNN model structure
Layer(options)  Output shape  Number of parameters 

Input  (None, 28, 28, 1)  
Conv2D(filters=32, kernel_size=(3, 3), activation=‘relu’)  (None, 26, 26, 32)  320 
Conv2D(filters=64, kernel_size=(3, 3), activation=‘relu’)  (None, 24, 24, 64)  18496 
MaxPooling2D(pool_size=c(2, 2))  (None, 12, 12, 64)  0 
(None, 12, 12, 64)  0  
Flatten  (None, 9216)  0 
Dense(units=128, activation=‘relu’)  (None, 128)  1179776 
(None, 128)  0  
Dense(units=10, activation=‘softmax’)  (None, 10)  1290 
MNIST; Evalutation for each set in two CNN models
Set  Model with dropout  Model without dropout  

Loss  Accuracy  Loss  Accuracy  
Training  0.0258  0.9915  0.0045  0.9986 
Validation  0.0401  0.9897  0.0560  0.9876 
Test  0.0302  0.9920  0.0465  0.9888 
CIFAR10; Overall structure of all CNN models
Layer  Output shape  Kernel size 

Input  (None, 32, 32, 3)   
Convolution Block 1  (None, 16, 16, 16)  (3, 3) 
Convolution Block 2  (None, 8, 8, 32)  (3, 3) 
Convolution Block 3  (None, 4, 4, 64)  (3, 3) 
Flatten  (None, 1024)   
(Dense)  (None, 128)   
Output  (None, 10)   
CIFAR10; Summary of model structures and test errors of all CNN models
Model  Number of Conv. Layers  BN  Number of FC Layers  Error 

M1  1  NO  1  0.2896 
M2  2  NO  1  0.2734 
M3  2  YES  1  0.2389 
M4  2  YES  2  0.2365 
M5  2  YES  3  0.2419 
M6  3  YES  1  0.2171 
M7  3  YES  2  0.2170 
M8  3  YES  3  0.2189 
Parameters and mainly used Hyperparameters with their default values in Keras
Parameters  

Layer  Variable  Default value in Keras 
Dense, Convolutional  weights bias  
Layer or other  Variable with default value in Keras  Options 
Dense  number of fullyconnected layers to stack  
number of hidden nodes  
activation function of dense layer  
if not defined, output is just linear combination 

Convolutional  number of convolutional layers to stack  
number of output filters  
length of the window of each filter (1D)  
width and height of each filter (2D)  
activation function of convolutional layer  
valid’ (no padding)  
same’ (padding to make the output’s shape same as the input’s)  
steps of the convolutional filter moves at once  
Pooling  size of pooling,  
an integer (1D) or a list of two integers (2D)  
steps of pooling. if not defined, set to 

Dropout  the sampling rate of the input units to drop  
Global  number of entire training epoch  
number of observations using at once per update  
number of epochs with no improvement after which training will be stopped  
Optimizers (ex. RMSprop)  learning rate of an optimizer  
amount of learning rate decline over each update 