Learn Convolution through Understanding torch.nn.Conv2d() Class
A good way to understand an algorithm is getting hands dirty to implement it yourself from scratch or you could learn from other’s code. The purpose of this article is to understand how convolution works in 2D and meanwhile grasp torch Conv2D class.
Instead of only rely on theory or only look at the code, this article looks at library implementation of 2D convolution of Pytorch and try to grasp in-depth view on the algorithm.
torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros')
The API of Conv2D is defined as above in the Pytorch framework. Simple, right? First of all, Conv2D is one of the many pre-implemented layers in the nn class.
Basically the seemingly complicated formula computes the relationship between input and output of convolution layer. The following figure from Stanford CS231n[2] demonstrates how convolution works. In the figure, it is assumed that the input layer is a (7x7x3) tensors, where input channel (in_channels) is 3, represents R,G,B color channels of an image.
Number of channels depends on how many filters are applied. In the above example, the out_channels=2, 2 filters of the dimensions 3x3x3 is used (a cube).
The kernel size denotes just like its name, the size of the kernel. Its value could be a scalar or a tuple. If input is a integer scalar value, it will be translated as 2D filter, for example, kernel size 3 means kernel is 3x3 tensor.
self.conv1 = nn.Conv2d(3, 6, 5)
Above code is not easy to understand. We build a convolution layer with input channels 3, and output channels is 6 with a 5x5 kernel. This implies, we use 2 different 5x5 kernels for this layer, since we have input channel 3, thus each 5x5 kernel output 3 channels each, we have 6 channels as output.
Stride and padding are very easy to understand concepts, so no further explanation will be give here.
Dilation is a special way to perform convolution, the following figure from [3] best describes how it works. How and why dilation works will be described in a future article. The vale denotes the space between kernel elements. For example, in a following figure, the dilation is 1.
Finally, the group parameter builds relationship between input and output, for example an input is convoluted with a kernel belongs to a group. Default value is 1, which means no matter how many kernel you use, every output will be in the same group.
Future article will be further discuss on the other API in details.
Reference:
[1] https://pytorch.org/docs/stable/nn.html#conv2d
[2] https://cs231n.github.io/convolutional-networks/#conv
[3] https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md