The central operation in the convolutional layer of a CNN is a *convolution*. In this section, we will describe convolutions.

Convolutions work with a small matrix called a *kernel*. The *dot product* is taken between a slice of the input image and the kernel. The dot product is a fancy term for “element-wise multiplication of two matrices”. For example, the dot product *A ∙ B* between two matrices A and B is calculated as follows:

All we do is multiply the corresponding positions of both matrices. For example, the top left corner has the value *1 x 0=0*. The top right corner has the value *25 x 4=100*; bottom left is *3 x 2=6*; bottom right is *7 x 70=490*. Then we add values up to get the final dot product: *A ∙ B*=0+100+6+490=596. So, the dot product is 596.

To get a full convolution, we need to “slide” the kernel across the image, taking the dot product for each slice of the image. Let’s see an example. We have this kernel:

We apply this kernel across this input image:

We first take the dot product of the kernel with the upper left 2x2 slice of the image, shown in the dotted lines below:

This corresponds to the following dot product:

The final dot product for this portion of the image is: 1 x 4-5 x 5+ -3 x 9+7 x 2= -34. This gives us the first number of the next convolutional layer:

To fill in the question mark on the upper right side of the next convolutional layer, we take the dot product of the kernel and the upper right side of the input image, specifically this portion of the image:

This corresponds to the following dot product:

The final dot product for this portion of the image is: 1 x 1-5 x 2+ -3 x 7+7 x 2= -16.

We do this for the bottom left and bottom right portions of the input image, and this gives us the final result matrix:

The kernel can be thought of as a *weight matrix*. Just as we learn m and b weights in linear and logistic regression and the edge weights in dense neural networks, the weights we learn in the convolutional layers of a CNN are the values of the kernels that we slide across the input image.

There are several kernel matrices which are slid across the input image, and each of those kernel matrices result in new matrices. Those new matrices have other kernels associated with them, each with separate weight values. We apply this process several times to get the basic architecture of a CNN:

Several layers consisting of image-derived matrices and kernel matrices are repeated for several layers. After several of these convolutional layers, a few *dense layers* are placed at the end of the network. Putting this together, a complete CNN looks like this:

In practice, there are a few extra modifications made to the basic CNN architecture above. Fortunately, all of these modifications are pretty straightforward to understand.

One of those is called *pooling*. The idea behind pooling is simple: you move a sliding window across the input matrix – just like in a convolution – except instead of calculating the dot product, you calculate a summary statistic about the contents of the window. Here is an example input matrix and the result of *max pooling* with a 2x2 sliding window, where the maximum of each window is calculated:

Taking all of the maximum values, we get the input matrix to the next convolutional layer:

The benefit of max pooling is that it down-samples the input representation, reducing the dimensionality. It also extracts the “sharpest” features of an image (the ones with higher pixel values). At the end of the day, though, people don’t fully understand why max pooling works so well – but it has performed well in experiments.