In the late 1990s, Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner created a convolutional neural network (CNN) based architecture called LeNet. The LeNet-5 architecture was developed to recognize handwritten and machine-printed characters, a function that showcased the potential of deep learning in practical applications. This article provides an in-depth exploration of the LeNet-5 architecture, examining each component and its contribution in deep learning.
Introduction to LeNet-5
LeNet-5 is a convolutional neural network (CNN) architecture that introduced several key features and innovations that have become standard in modern deep learning. It demonstrated the effectiveness of CNNs for image recognition tasks and introduced key concepts such as convolution, pooling, and hierarchical feature extraction that underpin modern deep learning models.
Originally designed for handwritten digit recognition, the principles behind LeNet-5 have been extended to various applications, including:
- Handwriting recognition in postal services and banking.
- Object and face recognition in images and videos.
- Autonomous driving systems for recognizing and interpreting road signs.
Architecture of LeNet-5
The architecture of LeNet 5 contains 7 layers excluding the input layer. Here is a detailed breakdown of the LeNet-5 architecture:
1. Input Layer
- Input Size: 32x32 pixels.
- The input is larger than the largest character in the database, which is at most 20x20 pixels, centered in a 28x28 field. The larger input size ensures that distinctive features such as stroke endpoints or corners can appear in the center of the receptive field of the highest-level feature detectors.
- Normalization: Input pixel values are normalized such that the background (white) corresponds to a value of 0, and the foreground (black) corresponds to a value of 1. This normalization makes the mean input roughly 0 and the variance roughly 1, which accelerates the learning process.
2. Layer C1 (Convolutional Layer)
- Feature Maps: 6 feature maps.
- Connections: Each unit is connected to a 5x5 neighborhood in the input, producing 28x28 feature maps to prevent boundary effects.
- Parameters: 156 trainable parameters and 117,600 connections.
3. Layer S2 (Subsampling Layer)
- Feature Maps: 6 feature maps.
- Size: 14x14 (each unit connected to a 2x2 neighborhood in C1).
- Operation: Each unit adds four inputs, multiplies by a trainable coefficient, adds a bias, and applies a sigmoid function.
- Parameters: 12 trainable parameters and 5,880 connections.
Partial Connectivity: C3 is not fully connected to S2, which limits the number of connections and breaks symmetry, forcing feature maps to learn different, complementary features.
4. Layer C3 (Convolutional Layer)
- Feature Maps: 16 feature maps.
- Connections: Each unit is connected to several 5x5 neighborhoods at identical locations in a subset of S2’s feature maps.
- Parameters and Connections: Connections are partially connected to force feature maps to learn different features, with 1,516 trainable parameters and 151,600 connections.
5. Layer S4 (Subsampling Layer)
- Feature Maps: 16 feature maps.
- Size: 7x7 (each unit connected to a 2x2 neighborhood in C3).
- Parameters: 32 trainable parameters and 2,744 connections.
6. Layer C5 (Convolutional Layer)
- Feature Maps: 120 feature maps.
- Size: 1x1 (each unit connected to a 5x5 neighborhood on all 16 of S4’s feature maps, effectively fully connected due to input size).
- Parameters: 48,000 trainable parameters and 48,000 connections.
7. Layer F6 (Fully Connected Layer)
- Units: 84 units.
- Connections: Each unit is fully connected to C5, resulting in 10,164 trainable parameters.
- Activation: Uses a scaled hyperbolic tangent function f(a) = A\tan (Sa), where A = 1.7159 and S = 2/3
8. Output Layer
In the output layer of LeNet, each class is represented by an Euclidean Radial Basis Function (RBF) unit. Here's how the output of each RBF unit y_iis computed:
y_i = \sum_{j} x_j . w_{ij}​
In this equation:
- x_j represents the inputs to the RBF unit.
- w_{ij} represents the weights associated with each input.
- The summation is over all inputs to the RBF unit.
In essence, the output of each RBF unit is determined by the Euclidean distance between its input vector and its parameter vector. The larger the distance between the input pattern and the parameter vector, the larger the RBF output. This output can be interpreted as a penalty term measuring the fit between the input pattern and the model of the class associated with the RBF unit.
Detailed Explanation of the Layers
- Convolutional Layers (Cx): These layers apply convolution operations to the input, using multiple filters to extract different features. The filters slide over the input image, computing the dot product between the filter weights and the input pixels. This process captures spatial hierarchies of features, such as edges and textures.
- Subsampling Layers (Sx): These layers perform pooling operations (average pooling in the case of LeNet-5) to reduce the spatial dimensions of the feature maps. This helps to control overfitting, reduce the computational load, and make the representation more compact.
- Fully Connected Layers (Fx): These layers are densely connected, meaning each neuron in these layers is connected to every neuron in the previous layer. This allows the network to combine features learned in previous layers to make final predictions.
The overall architecture of LeNet-5, with its combination of convolutional, subsampling, and fully connected layers, was designed to be both computationally efficient and effective at capturing the hierarchical structure of handwritten digit images. The careful normalization of input values and the structured layout of receptive fields contribute to the network's ability to learn and generalize from the training data effectively.