ConvNets
Summary of Convolutional Neural Nets

A neural network is an abstract model of either single- or multiple-connected neurons where each network contains layers and each layer contains an input and output. A convolutional neural network is one type of neural network that is characterized by its use of convolutions to extract useful information called features. The term “convolution” represents the use of arbitrary filter to scan through an image, which is a vague description of the first stage inside a network. The entire process of extracting useful features through multiple neurons and derive meaningful conclusions is the basis of the Convolutional Neural Network. Before going into detail on neural nets, we must first give a quick break down on the absolutely essential back propagation algorithm. Then, go through the important layers and their components that are essential to the learning algorithms, and then the algorithm themselves.

1. Forward, Backprop, Gradients

For now, this section gives a summary as a more invaluable source can be found at this blog written by Karpathy.

Both the terms neural net and deep learning shrouds and decorates the true substance, a large composite function consisting of one or many inputs, and one or many outputs. The term "forward" means to calculate the function $f$ of an input $x$ and retreive its output $y$. Of course, "backwards" would mean calculating the derivative $f^{-1}$ to obtain $x$. As well when calculating the derivative with respect to multiple inputs, we end up calculating the "gradient". How these relate to neural networks is abstractly, forwarding inputs and backpropagating outputs are the clockwork inside any neural network. What do we forward and backprop? Initially the input (perhaps an image or sentiment data) is fed into a function, which feeds into another function and so on, until a final output is received. As you may have seen by now, a neural network consists of functions layered ontop of functions, which can be thought of one very complicated function.

The first purpose of these functions, is not to calculate the output but to use an output, called a 'label', and manipulate the inputs such that their function results in the label. This process of learning is through the reptitive task of following the direction given by the gradient (as the derivative of the output. w.r.t. the input provides the change of output w.r.t. the input) in order to introduce incremental changes to the input (or parameters) in order to calculate, or learn, an optimal vector that obtains the labels given a set of inputs and parameters.

Thus in order for learning to proceed, derivatives must be calculated for all functions. Then, the inputs are fed into the network in order to train the neural network. And presto, you have a classifier vector that can be used for other inputs that are of the same niche, in order to obtain a 'label'. In other words, you have obtained a set of numbers, that when applied to your input, that would result in a label that relates to the labels you used when training your network. Using this as a classification tool, anything that has categories - weather, handwriting, medical conditions - have all been and are being disassembled by this sort of artificial intelligence.

2. Layers of a Neural Network

Arguably, learning happens in a step-by-step process in that the process of thinking happens sequentially. Representing that process with a neural network is no different, as the stages in a convolutional neural network imply. This section shall analogously describe in what way can a sequential neural network think and arrive at conclusions for classification. We will explain by having an abstract neural network that has three layers. This neural network can generalize to other neural networks including Convolutional Neural Networks and some Recurrent neural networks.

2.1 Discriminant Layer

The first layer will be called discriminant layer, which is an informal name for generalization purposes. This layer will have direct contact with the data meant to be processed. The main purpose of this layer is to use discriminant functions to model input data, where each discriminant function can be seen as one neuron. The function used can be a linear regression (or linear neuron) where one would take the mean squared error to obtain useful output, or a more complex model that uses the least squares estimate with a regularizor (ridge regression).

In both cases, the goal is to obtain a function that models the input and is differentiable (a vital characteristic to every layer). This means that if our input data, say an image, contains a pattern that is associated with a polynomial equation of degree 50, then we would like our discriminant function with the same number of oscillations (“wiggles” of a graphed polynomial function) as a polynomial function of degree 50 – which can be achieved by adjusting the function, or the regularizor.

Our analysis of this layer will not pertain to ridge regression as the only relevance it has with neural networks is it models the input. This is a high-level picture of the layer so here are some more elaborations. In order to obtain useful information, which are features from an image, we must choose a model and initialize parameters important to the model thus requires supervision.

This model at its end must fit the data and its fitness is measured by a regression analysis (measuring distance between data points and the predicted points by the model). These are expensive operations, in terms of both supervision and data. For each function that attempts to model the data, one must determine the regularizor constant (should they use ridge regression) through trial and error, or the degree of the polynomial function should they have a non-linear function.

Both cases are realistic, as ridge regression has been shown to fit better models through experimental results, and the case for non-linear models is that the real world data generally follows a non-linear function. Methods for choosing good initialization parameters is a current topic underdevelopment where some work has been done by [Denil et al., 2013].

Once we have the correct models, a message output is made and sent to the next layer such that the input is multiplied with the parameter to produce an output. In our [Figure 1.3], we only have a single discriminant function that is a linear function. Our linear function is in the form of $f(x_i )= \theta_0+ x_i$ $θ_1$ where $θ_0$ is the bias factor that translates our model up and down, and $θ_1$ determines the curvature, or though in this case the slope, of our model. [Figure 1] models this as $f(x_i )=x_i$ , where x is an image tensor multiplied by a vector of parameters, $\theta$ (or $w$ weight), and $\theta$ contains $[\theta_0,\theta_1]$ (The dimensions of $x$ will be further elaborated in Section 4). This linear model multiplies each input $x_i$ by a weight, then sent to be further transformed by a logistic function.

2.2 Softmax Layer

The next layer is Softmax, or conventionally the Log Softmax layer, which itself is a logistic regression (Note that the squasher/sigmoid layers has been combined into this layer, which I personally found more intuitive). The name refers to the way in which the input $(x_i θ_1)$ to the Softmax layer is normalized to become bounded between two values. The Softmax function is an equation of the form [Figure 2.4], which describes a multinomial logistic regression [Cawley, 2007].
Because it is a logistic function, the bounds are always between 0 and 1 [Hosmer, 2013] and is a function that computes probability [Cawley, 2007].

In reality, a real neuron fires a signal when it reaches a threshold, where as this layer 'fires' when the value of the normalized input approaches to a threshold close to 1, where the firing indicates that has recognized a feature.

The inputs for this layer will be the outputs of the discriminant layer, and this layer transforms the outputs of the discriminant layer into signals. The sigmoid function is also commonly used as the Softmax function[Funahashi, 1989]. In otherwords, the Softmax function must be a logistic function that calculates the probability, and in this case the certainty, of an input $x_i$ when the parameter $θ$ is being signaled (when $Pr[θ_j=1]$).

You will often see that during calculations, the exponentiation calculation is simplified by taking the log of each Softmax hence we often use the Log Softmax [Figure 1.2]. Afterwards, the output of each LogSoftmax are summed to create the Negative Log Likelihood(NLL), and in other words, we sum the certainty of inputs when signaled by different parameters $θ_j$.

2.3 Negative Log-Likelihood Layer

As the Negative Log-Likelihood(NLL) is considered a second function, it is represented as a separate layer. The Cost function $C(θ)$ or negative log likelihood(NLL) is the cross-entropy function which essentially calculates the probability, or likelihood, of a feature from that has been propagated to this layer. The reason why it is called the cross-entropy is due to the function [Figure 1.1] and the form of the entropy function [Figure 2.1] are the same.

Indeed, as the Softmaxes are essentially a calculation of probability of input given parameter $Pr[x_i|\theta]$, the sum of the Softmaxes is a calculation of entropy, also known as certainty. Note that the likelihood function is the sum of all LogSoftmaxes and the Negative Log-Likelihood takes the negative of the sum.

This means that NLL is a measure of uncertainty, thus the NLL is a loss function with the objective being to minimize, hence the alias Cost function[Murata et al., 1994] – because you would like to be the least uncertain about a feature given a parameter.

This can be analagous to the summation of a "curve signal" and other rounded signals to effect the conclusion of a pre-labelled "face", and each normalized signal here is a Softmax. Going back, the balancing of weights associated to a normalized signal was done in the discriminant layer (or "filter layer" that is later discussed for ConvNets). The likelihood function takes each likelihood of a parameter (candidate signal) and sums them using a differentiable entropy function.

2.4 The Last Step

Again, all equations discussed so far are differentiable and is an important property for optimization. Differentiability will allow for the final step of performing an efficient search for global minima, as with a loss function, we are able to use optimization function – most notably Stochastic Gradient Descent (SGD)[Bottou, 1991]. The term comes from the computed gradient, a matrix of derivatives, and gradient descent, the process of stepping towards the optimal solution.

Stochastic means it steps down the function by a deterministic interval [Bottou, 1991]. The intuition behind gradient descent for finding a solution can be shown by finding the minimum of a parabola: one can either calculate the point at which the slope is zero, or follow the model by descending towards the vertex. The former method will not work for functions with multiple "valleys" thus the latter is preferred. Our Cost functions for neural nets can be solved by SGD even if our function is non-convex (contains not just one bowl) [Dauphin et al., 2014].

Therefore the last step of this process is to differentiate the likelihood function in order to find the most optimal solution, by doing back-propagation (Again, please see the first section for back-prop). Our discussion will not pertain to different types of optimization functions, however, but we will simply state that they are used to find the solution of the negative log likelihood function.


Figure 1


2.5 Conclusion

One characteristic of these networks is it is most computationally and storage expensive in the first layer, and computations decrease as the output is propagated to the next layer – which we call forward propagation. A more important attribute however is that these networks are sequential and are composite functions [Figure 2.2].

We have discussed each layer and breifly described after we obtain a solution to the NLL as the output classifier is the end of a forward propagation and is a scalar sum of LogSoftmaxes. Also it is a vector with each element being one sum when put with respect to the entire input image $X$, which is a tensor. Modern learning is all about improving this classifier vector, which leads us to look towards the recent advances of back-propagating neural networks for image classification.


Figure 2


3. Convolutional Neural Nets

Our broad explanation of a neural network is not enough to be a description of a Convolutional Neural Network (CNN), however they share many of the same requirements: inputs, outputs, and functions and derivatives for each layer. We have already stated that ConvNets are characterized by their use of filters, where each scan of an image is called a convolution.

We will assume that the input data, $X$, is an image and speak in terms of images, while actual input data can be anything as long as it is comprehensible. Naturally, a ConvNet is a composition of layers and thus we will split this section into its corresponding layers: Filter, Max-Pooling, ReLU, and Softmax.

3.1 Filter Bank Layer

When an image first is processed, it is filtered through a parameter $θ$ (or $W$, weight) [Krizhevsky et al., 2012]. Note that the image is a tensor, or multidimensional, where the depth size three represents RGB, and the width and height are the dimensions of the actual image [Krizhevsky et al., 2012].

Like an actual filter, the θ represents any distinguishing feature, so if $θ$ is a model that represents a vertical line, it will get vertical lines from the image tensor. The image tensor is filtered with a $θ$ that is also multi-dimensional (2D or 3D) and produces a feature map [Cimpoi et al., 2014].

The process of where the filter “scans” the entire image is called a convolution. Imagine a 1D Image $X$, and a 1D filter $W$ (another way of denoting $θ$ ) which produces a feature map $Z$. The number of $W_i$ filters in $W$ also represents the number of feature maps that will be produced so this is the most computationally and storage expensive.

In relation to our abstract model, the number of $W$ parameters is also some discriminant function – so the number of filters $W$ is the number of neurons for this stage, where a W is some linear/polynomial/vertical line discriminant neuron that produces $X_i W_i$ .

In practice, the image tensor $X$ is padded with 0's on the edges, so that the filter is able to scan the edges (that way, the feature map should be around the same size as the image). As well, the $W$ filter slides in strides at some fixed value (think of scanning). The ConvNet of ImageNet 2013 convolves with strides of 2, as image artifacts were found when strides were at four [Zeiler and Fergus, 2013].

3.2 Non-Linearity Unit - ReLU

The ConvNet “squashing” function is most often a Rectified Linear Unit (ReLU) layer as this model has been popularized since the early 2010’s by [Krizhevsky et al.,2012]. In our previous abstract model, we had a logistic model for signal transformation. Though a logistic or TanH function is what generally squashes the neuron signal, we use the ReLU layer because it is computationally less expensive and the effect on the output is minimal or non-existant compared to the other functions. [Krizhevsky et al.,2012].

The ConvNet that has a ReLU unit also has a multinomial logistic Softmax but it comes after the max-pooling layer [Krizhevsky et al.,2012]. This differs from our abstract model in that we combined signal transformation and the Softmax calculation into one layer.

The function of a ReLU is extremely simple: $f(x)=max(0,x)$ [Krizhevsky et al.,2012]. This means the model is a straight-line until $x = 0$, where the slope becomes one. This means that if there are some training samples that are positive, then the signal will activate for that input and we will “learn” that input[Krizhevsky et al.,2012].

After the inputs pass through the ReLU unit, we put it into pooling. The reason that this is before the Pooling, is that this filters out inputs from the Filter Bank layer that are negative and "need not be learnt”, which increases efficiency. Thus, most of the computation is again at the filter layer.

3.3 Feature Pooling Layer

After the ReLU unit we do max-pooling operations on a set of adjacent units in the modified feature map. As well, a pooling operation on another pooled layer is called overlapping pooling[Krizhevsky et al.,2012].

This is a very simpleton operation as all it does is take the max(input) from some $n$ by $m$ unit, $Ω$, or kernel map[Krizhevsky et al.,2012], and that will be the output of this layer. (Our model assumes it takes max(input) though some choose to take average(input)). This operation is important, as it allows not just compression but for invariance[Krizhevsky et al.,2012].

Invariance is vital for after the training a CNN using stages, since we must test them with other images[Krizhevsky et al.,2012]. Essentially, this allows pictures to not be exactly the same and thus prevents over-fitting[Krizhevsky et al.,2012] where over-fitting is when a neural network is accustomed to the training data too much such that it does extremely poor on the test data – discussed later.

Note that forward and back-propagation equations remain at similar structures, except for a small difference. Because the output of a forward propagation is simply a result of a max operation, then the back-propagation would result in a sparse feature map with all zeroes except for the cell that was the maximum of the $Ω$ pool. This is because only the feature that was selected was forwarded thus we only have that value.

3.4 SoftMax Layer

This will be brief, as nothing has changed from our abstract model. Still, this is
a logistic regression that sums all of the signals that has passed the ReLU test. However, [Figure 7] illustrates Softmax Layers in practice, in that the last three Layers all calculate the Softmax and the last layer would be the NLL layer. In practice, the NLL is often combined with the Softmax layers as it is a simple sum of Softmaxes. Also to note the Layers 6-8 are fully-connected [Krizhevsky et al.,2012], which by speculation, means every neuron is connected to the neuron in the next layer.

Figure 3: “...The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440– 186,624–64,896–64,896–43,264– 4096–4096–1000...The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels.” [Krizhevsky et al.,2012]

4. Training, Testing

Once a neural net algorithm has been built, we can train it by using labelled data. Training refers to the use of pre-labeled data to configure the outputs of a neural net.
After the model has found the optimal solution by back-propagating outputs and solving gradients for each labelled image, then we have a trained neural network for image classification.

A convolutional network can be thought of as the power of imagination, to be able to visualize a model without data after it has been trained. Testing comes when we must validate our model, the imagination of the world, against real world test data. We will explain validation by way of example. If a million duck images were pre-trained into our network and produced an output classifier labelled 'duck' (since the images were pre-labelled the CNN we know that the output classifier is for ducks) then we can test a new image of a duck that was not in our training data. If the network does not correctly label the duck, then the resulting error value will be recorded of which we can use to make further adjustments to the model.

The process that splits the original dataset into trained and test data, then allowing the model to train on these values is called cross-validation. Cross-validation is generally considered part of the neural network algorithm and is its final process. As of now, this process is generally supervised as adjusting the number of filters or types of filters (which models to include), is determined by the supervisor. After this process is finished, we have a fully trained model able to classify all images with the same labels within our dataset.

5. Modern-day Results (2015)

Results from [Figure 4] show the extreme difference from a pre-trained model to a model that no previous existing learnt knowledge. The top column represents the state-of-the-art Image classification model that does not use ConvNets[Zeiler and Fergus, 2013]. We see that non-ConvNet models acheive better results when there was no previous data (55.2% > 38.8% Best, 40.5 > 9.0 Worst) but definitely cannot reach up to the scores with pre-trained ConvNets.

# Train Acc % 15/class Acc % 30/class Acc % 45/class Acc % 60/class
(Sohn et al., 2011) 35.1 42.1 45.7 47.9
(Bo et al., 2013) 40.5 ± 0.4 48.0 ± 0.2 51.9 ± 0.2 55.2 ± 0.3
*Non-PreTrained 9.0 ± 1.4 22.5 ± 0.7 31.2 ± 0.5 38.8 ± 1.4
*ImageNet-PreTr. 65.7 ± 0.2 70.6 ± 0.2 72.7 ± 0.4 74.2 ± 0.3

Figure 4 –The results for # labelled images per class from ImageNet2013 [Zeiler and Fergus, 2013]


The results by [Simonyan and Zisserman, 2014] has many ingenious tricks such as combining two ConvNets, scale jittering(changing the scaling during training), and using large sets of crops during testing to acheive outstanding results. It is considered an amazing feat to have acheived amazing results without the complexity of the model that won ImageNet 2014. From [Figure 5] they were able to obtain 93.2% classification accuracy using their CNN and perhaps signify that their fusion network means the real-world 'true' neural network could be a composition of CNNS.

Combined ConvNet models Error
top-1 val top-5 val top-5 test
ILSVRC - 2014 24.7 7.5 7.3
Post ILSVRC, using multi-crop & dense eval. 23.7 6.8 6.8

Figure 5 –The results from ImageNet2014 [Simonyan and Zisserman, 2014]


6. Challenges

Modern-day ConvNets are now considered to be close to its peak in that any image can be classified given a data set at minimal error and supervision [Simonyan and Zisserman, 2014]. In other words, supervision is only required is developing the ConvNet, and afterwards anyone will be able to use the ConvNet to classify images. For example, if we pre-train a ConvNet such that it has cross-validated its results with the dataset, but then take a one random image that does not contain a label not within the dataset, that it should be able to classify it with high accuracy.

In recent years, back-propagation or namely DeConvNets has allowed visualizations of intermediate output to understand ConvNets properly and thus make further improvements [Zeiler and Fergus, 2013]. However, the challenge remains when we do not have big datasets.

Following that, “One-shot learning” is for the extreme case where only one or few images is available for training [Zeiler and Fergus, 2013] and "Zero-shot learning" is for truly unsupervised learning of labels without pre-labelled data [Socher, 2013] .

As seen from results, ConvNets of 2012 need at least more than 60 images to be semi-competent, versus the human brain that can more or less classify a person having seen an image of their face and body once. Current research is still investigating the optimal parameters/filters for a CNN as well as the use of minimal data to achieve an accurate classification, as well as good initialization parameters [Denil et al., 2013].



References

[Freitas, 2015] Freitas, Nando De. ”Course: Machine Learning 2015”. Oxford 2015. https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/.

[Zeiler and Fergus, 2013] Zeiler, M.D., Fergus, R., 2013. Visualizing and Understanding Convolutional Networks. ArXiv13112901 Cs.

[Simonyan and Zisserman, 2014] Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. ArXiv14091556 Cs.

[Denil et al., 2013] Denil, M., Shakibi, B., Dinh, L., Ranzato, M.A., de Freitas, N., 2013. Predicting Parameters inDeep Learning, in: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 26. Curran Associates, Inc., pp. 2148–2156.

[Krizhevsky et al.,2012] Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet Classification with Deep Convolutional Neural Networks, in: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp. 1097–1105.

[Hosmer, 2013] Hosmer, Jr. David W., Stanley Lemeshow, and Rodney X. Sturdivant. “The Multiple Logistic Regression Model.” In Applied Logistic Regression, Third Edition, 35–47. John Wiley & Sons, Inc., 2013. http://onlinelibrary.wiley.com/doi/10.1002/9781118548387.ch2/summary.

[Cawley, 2007] Cawley, Gavin C., Nicola LC Talbot, and Mark Girolami. “Sparse Multinomial Logistic Regression via Bayesian l1 Regularisation.” Advances in Neural Information Processing Systems 19 (2007): 209.

[Funahashi, 1989] Funahashi, Ken-Ichi. “On the Approximate Realization of Continuous Mappings by Neural Networks.” Neural Networks 2, no. 3 (1989): 183–92.

[Murata, 1994] Murata, Noboru, Shuji Yoshizawa, and Shun-ichi Amari. “Network Information Criterion-Determining the Number of Hidden Units for an Artificial Neural Network Model.” Neural Networks, IEEE Transactions on 5, no. 6 (1994): 865–72.

[Martens, 2010] Martens, James. “Deep Learning via Hessian-Free Optimization.” In Proceedings of the 27th International Conference on Machine Learning (ICML-10), 735–42, 2010.

[Dauphin et al., 2014] Dauphin, Yann N., Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. “Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.” In Advances in Neural Information Processing Systems, 2933–41, 2014.

[Bottou, 1991] Bottou, Léon. “Stochastic Gradient Learning in Neural Networks.” Proceedings of Neuro-Nımes 91, no. 8 (1991).

[Hinton et al., 2012]Hinton, Geoffrey E., Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R. Salakhutdinov. “Improving Neural Networks by Preventing Co-Adaptation of Feature Detectors.” arXiv Preprint arXiv:1207.0580, 2012. http://arxiv.org/abs/1207.0580.

[Cimpoi et al., 2014] Cimpoi, Mircea, Subhransu Maji, and Andrea Vedaldi. “Deep Convolutional Filter Banks for Texture Recognition and Segmentation.” arXiv:1411.6836 [cs], November 25, 2014. http://arxiv.org/abs/1411.6836.

[He et al., 2015] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.” arXiv:1502.01852 [cs], February 6, 2015.

[Socher, 2013] Socher, Richard, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. “Zero-Shot Learning through Cross-Modal Transfer.” In Advances in Neural Information Processing Systems, 935–43, 2013. http://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer.




Last updated: April 2nd, 2016