Building upon the fundamentals covered in Part 1, this guide delves into the specifics of Convolutional Neural Networks (ConvNets), providing a more detailed understanding of their architecture and functionality. While some topics are inherently complex, this post aims to provide a concise yet comprehensive overview, with links to research papers for deeper exploration.
Stride and Padding: Fine-Tuning Convolutional Layers
Let’s revisit the core of ConvNets: convolutional layers. We remember filters, receptive fields, and the convolution process. After selecting filter sizes, two crucial parameters determine a layer’s behavior: stride and padding.
Stride dictates how the filter moves across the input volume during convolution. In Part 1, the filter shifted one unit at a time, implying a stride of 1. The stride is typically chosen to ensure the output volume has integer dimensions, avoiding fractional results.
Consider a 7×7 input volume, a 3×3 filter, and a stride of 1. This is the standard scenario:
Standard convolution with a stride of 1, illustrating the movement of the filter across the input volume.
Now, let’s increase the stride to 2 and observe the effect on the output volume:
Convolution with a stride of 2, showing how the receptive field shifts by two units, reducing the output volume size.
With a stride of 2, the receptive field shifts by two units, reducing the output volume. A stride of 3, in this case, would cause spacing issues, preventing the receptive field from fitting correctly. Programmers often increase stride to reduce receptive field overlap and decrease spatial dimensions.
Now, let’s discuss padding. Consider applying three 5x5x3 filters to a 32x32x3 input volume. The resulting output volume is 28x28x3. The spatial dimensions shrink with each convolutional layer, which can lead to information loss in the early layers.
To maintain the original spatial dimensions (32x32x3 in this case), we can apply zero-padding. Zero-padding adds a border of zeros around the input volume. A zero-padding of 2 would transform the 32x32x3 input volume into a 36x36x3 volume.
Illustration of zero-padding, where the input volume is padded with zeros around the border to control the output volume size.
With a stride of 1, setting the zero-padding size to:
where K is the filter size, ensures that the input and output volumes have the same spatial dimensions.
The general formula for calculating the output size of a convolutional layer is:
where:
- O = output height/length
- W = input height/length
- K = filter size
- P = padding
- S = stride
Hyperparameter Optimization: Finding the Right Configuration
Determining the optimal number of layers, convolutional layers, filter sizes, stride, and padding values is a complex challenge. There’s no universally accepted standard, as the ideal network architecture heavily depends on the data characteristics. Data size, image complexity, and the specific image processing task all play a role. When selecting hyperparameters, consider the right combination that creates abstractions of the image at an appropriate scale.
ReLU (Rectified Linear Units) Layers: Introducing Non-Linearity
Following each convolutional layer, it’s standard practice to apply a non-linear layer, also known as an activation layer. The purpose of this layer is to introduce non-linearity into the system, which primarily performs linear operations during the convolutional layers (element-wise multiplications and summations). Historically, non-linear functions like tanh and sigmoid were used, but research has shown that ReLU layers are superior. ReLU layers enable faster training (due to computational efficiency) without significantly impacting accuracy. They also mitigate the vanishing gradient problem, where the lower layers of the network train slowly due to the exponential decrease of the gradient through the layers. The ReLU layer applies the function f(x) = max(0, x) to all values in the input volume, effectively changing all negative activations to 0. This enhances the non-linear properties of the model and the overall network without affecting the receptive fields of the convolutional layer.
Pooling Layers: Downsampling and Feature Abstraction
After ReLU layers, programmers may incorporate a pooling layer, also called a downsampling layer. Max pooling is the most common type, employing a filter (typically 2×2) and a stride of the same length. This filter is applied to the input volume, and the maximum value within each subregion is outputted.
Illustration of max pooling, where the maximum value within each filter region is selected as the output.
Other pooling options include average pooling and L2-norm pooling. The rationale behind pooling layers is that once a specific feature is detected in the original input volume (indicated by a high activation value), its precise location becomes less critical than its relative location to other features. This layer significantly reduces the spatial dimensions (length and width, but not depth) of the input volume, serving two primary purposes: reducing the number of parameters or weights by 75% (thereby lowering computational cost) and controlling overfitting. Overfitting occurs when a model is excessively tuned to the training examples, hindering its ability to generalize to validation and test sets. A telltale sign of overfitting is a model that achieves near-perfect accuracy (99-100%) on the training set but performs poorly (e.g., 50% accuracy) on the test data.
Dropout Layers: Preventing Overfitting
Dropout layers serve a specific purpose in neural networks: combating overfitting. As discussed, overfitting occurs when the network’s weights are too closely aligned with the training examples, leading to poor performance on new data. The concept of dropout is simple: this layer randomly “drops out” a set of activations by setting them to zero. This forces the network to be redundant, meaning it should be able to produce the correct classification or output even if some activations are missing. This prevents the network from becoming overly “fitted” to the training data, thereby alleviating overfitting. Importantly, dropout layers are only used during training, not during testing.
Network in Network Layers: 1×1 Convolutions
A network in network layer uses a 1×1 filter in a convolutional layer. Initially, the utility of this layer might seem questionable since receptive fields are typically larger than the space they map to. However, it’s crucial to remember that these 1×1 convolutions span a certain depth. We can visualize it as a 1x1xN convolution, where N is the number of filters applied in the layer. Effectively, this layer performs an N-D element-wise multiplication, where N is the depth of the input volume.
Classification, Localization, Detection, Segmentation: Expanding Applications
Part 1 focused on image classification, where the goal is to input an image and output a class label from a predefined set of categories. However, object localization requires not only a class label but also a bounding box indicating the object’s location within the image.
An example of object localization, where the task is to identify the object and draw a bounding box around it.
Object detection extends localization by identifying and localizing all objects within an image, resulting in multiple bounding boxes and class labels.
Finally, object segmentation involves outputting a class label along with a precise outline of every object in the input image.
An example of object detection, where multiple objects are identified and localized within an image.
Transfer Learning: Leveraging Pre-trained Models
A common misconception is that creating effective deep learning models requires vast amounts of data. While data is critical, transfer learning has significantly reduced data requirements. Transfer learning involves taking a pre-trained model (a network whose weights and parameters have been trained on a large dataset) and “fine-tuning” it with your own dataset. The pre-trained model acts as a feature extractor. You remove the last layer and replace it with your own classifier (tailored to your specific problem). Then, you freeze the weights of the other layers and train the network. “Freezing” layers means preventing their weights from changing during gradient descent/optimization.
This approach works because the lower layers of a network trained on a dataset like ImageNet (14 million images with over 1,000 classes) learn to detect basic features like edges and curves. Unless your dataset and problem space are exceptionally unique, your network will likely need to detect these same features. Instead of training the entire network from scratch with random weight initialization, you can leverage the pre-trained model’s weights (freezing them) and focus training on the higher-level layers. If your dataset differs significantly from ImageNet, you may need to train more layers and only freeze a few of the lower layers.
Data Augmentation Techniques: Expanding Datasets Artificially
Given the importance of data in ConvNets, let’s explore methods for expanding existing datasets using simple transformations. Computers process images as arrays of pixel values. Even a minor shift, like moving the entire image one pixel to the left, which is imperceptible to humans, can be significant to a computer. Data augmentation techniques alter the training data in ways that change the array representation while preserving the label. This artificially expands the dataset. Common augmentations include grayscales, horizontal flips, vertical flips, random crops, color jitters, translations, rotations, and more. Applying just a few of these transformations can easily double or triple the number of training examples.