Deep Learning

From Grundy
Jump to navigation Jump to search

Deep learning is an artificial intelligence (AI) function that imitates the workings of the human brain in processing data and creating patterns for use in decision making. Deep learning is a subset of machine learning in AI that has networks capable of learning unsupervised from data that is unstructured or unlabeled. Neural networks are the workhorses of deep learning. And while they may look like black boxes, deep down they are trying to accomplish the same thing as any other model — to make good predictions.

Deep Learning has received a lot of hype in recent years due to its impressive performance on many applications, including language translation, medical diagnosis from X-rays, recognizing images to help with self-driving cars, beating the top Go players as well as beating high-ranking DotA players, learning how to play Atari games just from the pixel data...all these to name a few of Deep Learning’s recent accomplishments! Follow this article to get a glimpse of some of the amazing applications & breakthroughs of neural networks.

This post outlines the high-level overview & links to understanding the fundamentals of neural networks along with some renowned architectures of neural networks, their implementation using various frameworks & their applications.

Deep Learning Frameworks in Python

Deep Learning today is powered by very efficient libraries that run on GPUs. Some of the exciting libraries to look at are:

  • TensorFlow - The tool powering all your favourite Google products - Gmail, Google Translate, Google Search, Google Speech etc. TensorFlow was recently made open source on Github. TensorFlow has a python API making machine learning easier and efficient. Have a look at our TensorFlow tutorial to find a list of TensorFlow resources.
  • PyTorch - Based on the Torch library, PyTorch was primarily developed by Facebook's AI Research lab & has used for applications such as computer vision and natural language processing. Here are some interesting tutorials to get you started with PyTorch
  • Theano
  • Keras - Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation.

You can choose to work with any of these library that you are comfortable with. For all algorithms & concepts here, we will try to provide links containing implementations in Pytorch, Tensorflow & Keras.

Fundamentals of Neural Networks

In 1943, scientists tried to model the functioning of the neurons in the human brain using electrical circuits. Since the brain is highly effective at learning, computer scientists tried to combine this structure with the processing power of computers, and the first multilayered network was developed in 1975. You might wonder, But what is a neural network?, and this video does an excellent job at explaining this. In fact, watch the complete playlist to get an idea of how a neural network is trained the below-mentioned topics.

A neural network is a system of connected perceptrons, read the article to understand what perceptron implies. In short, it takes inputs that are multiplied by certain weights, summed up, and passed through an Activation function to determine the output of a node.

  • To learn more about the functioning of Perceptron Algorithm, head here.
  • Here is a simple implementationof this algo in python, without the use of DL frameworks.
  • This video briefly explains the correct choice of activation functions.

Feedforward Neural Networks + Backpropagation

Recall that a hidden layer is a set of neurons/perceptrons that take an input and return a weighted sum of inputs followed by some activation. The situation becomes complex as our network becomes deeper.

The input and weights are used to calculate the results in a forward manner, while the errors have to be propagated backward, using gradient descent in order to improve the accuracy of the network prediction by reducing a loss function. This reverse computation is called backpropagation. While the math behind backprop is not that difficult to understand, coding it is not that trivial. Even though DL frameworks allow you to skip this implementation, we recommend you to first try implementing it on your own. Now that you know about the basics of neural networks. Links to relevant articles & videos -

  • This article sums up all the aspects of a neural network.
  • This article explains the math behind backprop.
  • Here is an implementation of backprop in python.
  • Here is why you should implement backprop on your own.
  • See how using Pytorch makes implementing a feed-forward neural network a child's play.
  • We highly recommend that you complete this tutorial on introduction to using PyTorch.

Practical Aspects of Deep Learning

When we are starting on a new application, it’s almost impossible to correctly guess the right values for all of these, and for other hyperparameters. So in practice applied machine learning is a highly iterative process, in which we often start with an idea, code it up, run an experiment, get back a result that tells us how well this particular network, or this particular configuration works. And based on the outcome, then refine the ideas and change the choices to try to find a better and a better neural network.

Up next, we discuss some of the important practical issues & relevant solutions/alternatives to tackle such issues & improve the model accuracy.


Training your neural network requires specifying an initial value of the weights. A well chosen initialization can:

  • Speed up the convergence of gradient descent
  • Increase the odds of gradient descent converging to a lower training (and generalization) error

The different notable types on initialization include:

1. Zero Initialization: Here, we initialize all parameters (weights & biases) to zeros. It can be shown that such a symmetry where the weights for any two neurons in a layer are same intially, then the gradient steps for both of them turn out to be exactly the same and hence they learn the same thing.

2. Random Initialization: To break symmetry, the weights are randomly assigned. Following random initialization, each neuron can then proceed to learn a different function of its inputs. It is however okay to initialize the biases to zeros. It works as long as the weights are initialized differently. But this too has its own issues - poor initialization can lead to vanishing/exploding gradients (we will discuss this in the next section), which also slows down the optimization algorithm.

3. Xavier & He Initializations: These are certain methods of initializing weights randomly, but with a pre-defined variance, based on the number of neurons in the layer as well as the type of activation function used. Check out this article to know more bout the theory & implementations of these methods.

Refer to this article to get more insights into the various weight initialization techniques.

The tools Tensorflow, PyTorch & Keras provide a wide range of weight initializers for your neural networks. Check them out:

Vanishing & Exploding Gradient Issues

Training a neural network is all about calculating gradients & learning new weight based on backpropagation. Hence, learning is extremely susceptible to problems that arise directly from the nature of the computed gradients.

  • Vanishing Gradient Problem arises in cases where the gradient propagated backwards is very small. During back propagation when the weights are multiplied with these low gradients, they tend to become very small and “vanish” as they go further deep in the network. This makes the neural network to forget the long range dependency and doesn't learn to fit the data. This is a major problem, especially while using Recurrent Neural Networks or RNNs [This can be understood better once we get into the section about RNNs]. It can be solved by using activation functions like ReLU which contributes a value of 1 to the gradient. It can also be tacked by using Long Short-Term Memory architecture (LSTM). instead of simple RNNs. Here is an article along with a video at the end to explain this concept.
  • Exploding Gradient Problem is the exact opposite of the vanishing gradient problem, where the gradient of the activation function is too large. During back propagation, it leads to very large updates that makes it unstable and renders learning inefficient. Also, it makes the weight of a particular node very high with respect to the others rendering them insignificant. A common technique used in RNN's to avoid this is clipping the gradient, in which the gradient is scaled down. This can be achieved using:
    • tf.clip_by_global_norm() (Tensorflow)
    • torch.nn.utils.clip_grad_norm_() (PyTorch)
    • clipnorm parameter available while initializing an optimizer (Keras)

Here is another article-cum-video tutorial to address the issues of both Vanishing & Exploding Gradients


While it is desirable to fit the training data to a great accuracy but at the same it is undesirable that the network memorizes the training data instead of learning a generalization. Such a situation when the network memorizes the training data really well but fails to achieve good accuracy in the test data is called Overfitting. Such situations can arise when the network is trained on relatively small dataset or the paramters are not tuned properly or the architecture itself is not good.

One approach to reduce overfitting is to fit all possible different neural networks on the same dataset and to average the predictions from each model. This is not feasible in practice, and can be approximated using a small collection of different models, called an ensemble. A problem even with the ensemble approximation is that it requires multiple models to be fit and stored, which can be a challenge if the models are large.

We look at a few techniques to regularise the network.

Loss Function

  • Lasso Regression - Also referred to as the L1 regularisation. This technique involves addition of magnitude of the weights to the loss function as a penalty so that they don't become large enough and overfit the data.
  • Ridge Regression - Also referred to as the L2 regularisation. Unlike the L1 method, this method adds the squared values of the weights to the loss function on a similar note.

Let us try to understand what is it helpful in reducing overfitting. One of the popular ideas behind this is that adding that term leads to reduced magnitude of weights and that leads to lower value of activations. If you see the activation functions used, the graph is pretty much linear at values close to 0 and thus model becomes simpler and it is difficult for it to fit complicated functions and hence reduces overfitting.

  • Read the article to get into the details of L1 and L2 regularisation techniques.


It is a regularization method that approximates training a large number of neural networks with different architectures in parallel. On each iteration, we randomly shut down some neurons (units) on each layer and don’t use those neurons in both forward propagation and back-propagation. Since the units that will be dropped out on each iteration will be random, the learning algorithm will have no idea which neurons will be shut down on every iteration; therefore, force the learning algorithm to spread out the weights and not focus on some specific features.

  • Browse through this article to get a firm grasp on the mathematical aspects underlying the technique of using Dropouts & why they work in practice.
  • Here is another video tutorial to walk you through the intuition behind using dropouts in your NN models.
  • Tutorial to implement Dropout in Tensorflow.
  • Tutorial to implement Dropout in PyTorch (the first part involves Batch Normalization concept, which will be explained next, while the section on Dropouts is towards the later part of this tutorial. You can take a look at this after understanding Batch Normalization concept as well to get a better overview)

Data Augmentation

It is one of the most popular regularization techniques in Neural Networks ment specially for images. Usually, when our data set is small, the model tends to overfit the data and not generalise. Being able to collect more data is not an easy job to do. So, what is a popular way to tackle this is to augment the images available. Basically, we add new images to out data set by formatting the old ones like - Flipping the image horizontally or vertically, or translating an image parallel to it, or maybe rotating the image slighty or zooming into it randomly. Such methods help the model generalise well. To get an intuition as to what is really going on, you can think of it as - a cat is still a cat be it in any orientation, be it any location on the image. Image augmentation can be easily done using Image Data Generators in the frameworks like Tensorflow or other as mentioned earlier.

  • Dive into the articleto get a hang of what is really going on in augmentation technique.

Batch Normalization

Batch normalization (also known as batch norm) is a technique for improving the speed, performance, and stability of NNs. It is called batch normalization because, during training, we normalize each layer’s inputs by using the mean and standard deviation (or variance) of the values in the current batch.

Let us try to understand the purpose of normalising the values. As a matter of fact, networks learn faster and better on normalised input. So, when it is beneficial to normalise inputs to the network, we can try to normalise the inputs to the hidden layers too so that the learning is efficient at all layers and hence the entire network. Therefore, we carry out the normalization process at each layer.

Optimization Algorithms

Optimization algorithms help us minimize an error function [E(x)]. As you must now know, E(x) is a mathematical function dependent on the Model’s internal learnable parameters which are used in computing the target values(Y) from the set of predictors(X) used in the model.

The internal parameters of a Model play a very important role in efficiently and effectively training a Model and produce accurate results. This is why we use various Optimization strategies and algorithms to update and calculate appropriate and optimum values of such a model’s parameters which influence our Model’s learning process and the output of a Model. Optimization algos fall in two major categories -

  • First Order — These algorithms minimize or maximize a Loss function E(x) using its Gradient values with respect to the parameters.
  • Second Order — Second-order methods use the second-order derivative which is also called Hessian to minimize or maximize the Loss function, it is a Matrix of Second Order Partial Derivatives.

Although the Second Order Derivative may be a bit costly to find and calculate, but the advantage of a Second-order Optimization Technique is that it does not neglect or ignore the curvature of Surface. Also, in terms of Step-wise Performance, they are better. Head here to get an overview of the below discussed algorithms. Again, DL frameworks allow you to implement these out of the box, but knowing the math behind is essential to know when which one is better to use.


Momentum is basically a modification done to stochastic gradient descent. The high variance oscillations in SGD makes it hard to reach convergence, so a technique called Momentum was invented which accelerates SGD by navigating along the relevant direction and softens the oscillations in irrelevant directions.

  • This article explains with graphs, the benefits of using momentum.
  • This one implements it as well.
  • Watch Siraj Raval's video to get an introduction to all the optimization algos.


It simply allows the learning Rate to adapt based on the parameters. So it makes big updates for infrequent parameters and small updates for frequent parameters. For this reason, it is well-suited for dealing with sparse data. Thus, it uses a different learning Rate for every parameter θ at a time step based on the past gradients which were computed for that parameter. The main benefit of Adagrad is that we don’t need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that.


Yet another modification to SGD, read these to understand RMSProp -


Adam basically combines the advantages of RMSProp & AdaGrad. Instead of adapting the parameter learning rates based on the average first moment (the mean) as in RMSProp, Adam also makes use of the average of the second moments of the gradients (the uncentered variance). Read the following to understand Adam Optimizer -


A Restricted Boltzmann Machine is a generative stochastic artificial neural network that can learn a probability distribution over its set of inputs. Invented by Geoffrey Hinton, it is an algorithm useful for dimensionality reduction, classification, regression, collaborative filtering, feature learning and topic modeling. Given their relative simplicity and historical importance, restricted Boltzmann machines are the first neural network we'll tackle.

Hinton showed that RBMs can be stacked and trained in a greedy manner to form so-called Deep Belief Networks (DBN). DBNs are graphical models which learn to extract a deep hierarchical representation of the training data. It is an amalgamation of probability and statistics with machine learning and neural networks. Deep Belief Networks consist of multiple layers with values, wherein there is a relation between the layers but not the values. The main aim is to help the system classify the data into different categories.

  • This video will serve to provide a gentle introduction including the architecture & applications of DBNs

Convolutional Neural Networks (CNNs)

In neural networks, Convolutional neural network (ConvNets or CNNs) is one of the main categories to do images recognition, images classifications. Objects detections, recognition faces etc., are some of the areas where CNNs are widely used. Broadly speaking, deep learning CNN models to train and test, each input image will pass it through a series of convolution layers with filters (kernels), pooling, fully connected layers (FC) and apply Softmax function to classify an object with probabilistic values between 0 and 1.

  • To get started with intuition behind developing & using CNNs, here is an elaborate article
  • This article provides the necessary mathematical background to understand & implement your own CNNs.
  • Here is a video tutorial to help you understand the basic mechanics of CNN layers
  • After understanding the basics of CNNs, this tutorial will walk you through some advanced concepts like saliency maps, dilated convolutions, etc. This tutorial will also introduce you to the architectures of some classical ConvNets that have taken the world by storm.


An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal noise. Autoencoders consists of 3 components: encoder, code and decoder. The encoder compresses the input and produces the code, the decoder then reconstructs the input only using this code.

Sequence Modeling

Sequence Modeling is the task of predicting what word/letter comes next. Unlike the Feedforward NNs and CNN, in sequence modeling, the current output is dependent on the previous input and the length of the input is not fixed. Sequence models, in supervised learning, can be used to address a variety of applications including financial time series prediction, speech recognition, music generation, sentiment classification, machine translation and video activity recognition.

Sequence problems can be broadly categorized into the following categories:

  • One-to-One: Where there is one input and one output. Typical example of a one-to-one sequence problems is the case where you have an image and you want to predict a single label for the image.
  • Many-to-One: In many-to-one sequence problems, we have a sequence of data as input and we have to predict a single output. Text classification is a prime example of many-to-one sequence problems where we have an input sequence of words and we want to predict a single output tag.
  • One-to-Many: In one-to-many sequence problems, we have single input and a sequence of outputs. A typical example is an image and its corresponding description.
  • Many-to-Many: Many-to-many sequence problems involve a sequence input and a sequence output. For instance, stock prices of 7 days as input and stock prices of next 7 days as outputs. Chatbots are also an example of many-to-many sequence problems where a text sequence is an input and another text sequence is the output.

Recurrent Neural Networks (RNNs)

Traditional feedforward neural networks do not share features across different positions of the network. In other words, these models assume that all inputs (and outputs) are independent of each other. This model would not work in sequence prediction since the previous inputs are inherently important in predicting the next output. Recurrent Neural Networks(RNN) are a type of Neural Network where the output from the previous step is fed as input to the current step. The main and most important feature of RNN is Hidden state, which remembers some information about a sequence.

Long Short Term Memory (LSTM)

RNNs suffer from short-term memory. If a sequence is long enough, they’ll have a hard time carrying information from earlier time steps to later ones. During backpropagation, recurrent neural networks suffer from the vanishing gradient problem (discussed earlier). LSTMs have internal mechanisms called gates that can regulate the flow of information. These gates can learn which data in a sequence is important to keep or throw away. By doing that, it can pass relevant information down the long chain of sequences to make predictions. The success of LSTMs is in their claim to be one of the first implements to overcome the technical problems and deliver on the promise of recurrent neural networks.

Gated Recurrent Units (GRU)

Gated recurrent units are a gating mechanism in recurrent neural networks, introduced in 2014. GRUs are like LSTMs with forget gate but have fewer parameters than LSTMs, as they lack an output gate.


The encoder-decoder architecture for recurrent neural networks is the standard method to address sequence-to-sequence problems, sometimes called seq2seq. Here is a good tutorial (with code) to understand the working of encoder-decoder RNN models for Machine Translation.

Attention is a mechanism that was developed to improve the performance of the Encoder-Decoder RNN on machine translation. It is proposed as a solution to the limitation of the Encoder-Decoder model encoding the input sequence to one fixed length vector from which to decode each output time step. This issue is believed to be more of a problem when decoding long sequences.

Generative Adversarial Networks (GANs)

Generative adversarial networks (GANs) are algorithmic architectures that use two neural networks, pitting one against the other (thus the “adversarial”) in order to generate new, synthetic instances of data that can pass for real data. They are used widely in image generation, video generation and voice generation. To get a feel for the potential of GANs, here is an interesting read about some really cool applications of GANs.

Broadly in GANs, one neural network, called the generator, generates new data instances, while the other, the discriminator, evaluates them for authenticity. The goal of the generator is to lie (generate images/video/sound) without being caught. The goal of the discriminator is to identify images/videos/sounds coming from the generator as fake.

Two major problems that generally occur with the GANs:

  • Discriminator overpowering Generator: Sometimes the discriminator begins to classify all generated examples as fake due to the slightest differences. So, to rectify this, we will make the output of the discriminator unscaled instead of sigmoid (which produces only zero or one).
  • Mode Collapse: The generator discovers some potential weakness in the discriminator and exploits that weakness to continually produce a similar example regardless of variation in input.

There are several types of GANs based on their architecture:

If you want to get started with Deep Learning, specifically GANs, follow this Guide handcrafted by us.

See Also