Convolutional Neural Network(CNN): They are Used largely for Image processing tasks(Classification, Object Detection, Localization etc) and constitute 4 major operations namely Convolution, Non Linearity, Pooling and Classification as described below:

**Convolution:**The primary purpose of Convolution in case of a ConvNet is to extract features from the input image. Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data As we discussed above, every image can be considered as a matrix of pixel values. Consider a 5 x 5 image whose pixel values are only 0 and 1:

Also, consider another 3 x 3 matrix as shown below:

Then, the Convolution of the 5 x 5 image and the 3 x 3 matrix can be computed as shown in the animation below:

We slide the orange matrix over our original image (green) by 1 pixel (also called **‘stride’**) and for every position, we compute element wise multiplication (between the two matrices) and add the multiplication outputs to get the final integer which forms a single element of the output matrix (pink). Note that the 3×3 matrix “sees” only a part of the input image in each stride.

In CNN terminology, the 3×3 matrix is called a ‘**filter**‘ and the matrix formed by sliding the filter over the image and computing the dot product is called the ‘Convolved Feature’ or ‘Activation Map’ or the ‘**Feature Map**‘.

Convolution operation on an image:

Notice how these two different filters generate different feature maps from the same original image.In practice, a CNN *learns* the values of these filters on its own during the training process and the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.

**2. Non Linearity(ReLu):**

ReLu stands for Rectified Linear Unit and is a non-linear operation. The purpose of ReLU is to introduce non-linearity in our ConvNet, since most of the real-world data we would want our ConvNet to learn would be non-linear.Its output is given by:

ReLu replaces all negative pixel values in the feature map by zero.

**3. Pooling(To Reduce the dimensions of an image):**

Spatial Pooling (also called subsampling or downsampling) reduces the dimensionality of each feature map but retains the most important information. Spatial Pooling can be of different types: Max Pooling, Average Pooling etc.

#### 4. Fully Connected(Dense) Layer for Classification:

The term “Fully Connected” implies that every neuron in the previous layer is connected to every neuron on the next layer.The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset.

We will use Softmax function for final classification**. Softmax function **takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one.

#### The overall training process of the Convolution Network may be summarized as below:

**Step1:** We initialize all filters and parameters / weights with random values

**Step2: **The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.

**Step3:** Calculate the total error at the output layer (summation over all 4 classes)

**Total Error = ∑ ½ (target probability — output probability) ²**

**Step4:** Use Backpropagation to calculate the *gradients* of the error with respect to all weights in the network and use *gradient descent* to update all filter values / weights and parameter values to minimize the output error.

**Step5:** Repeat steps 2–4 with all images in the training set.

The above steps *train* the ConvNet — this essentially means that all the weights and parameters of the ConvNet have now been optimized to correctly classify images from the training set.

When a new (unseen) image is input into the ConvNet, the network would go through the forward propagation step and output a probability for each class (for a new image, the output probabilities are calculated using the weights which have been optimized to correctly classify all the previous training examples)

Source : https://medium.com/@ravishankar_22148/a-picture-is-worth-a-thousand-words-lets-figure-out-the-relevant-ones-15cbb56443ae