[AlexNet] ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

import torch
import torch.nn as nn

class AlexNet(nn.Module):
    def __init__(self, num_classes=1000) -> None:
        super(AlexNet, self).__init__()

        self.features = nn.Sequential(
            nn.Conv2d(3, 96, kernel_size=11, stride=4),
            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),

            nn.Conv2d(96, 256, kernel_size=5),
            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),

            nn.Conv2d(256, 384, kernel_size=3),

            nn.Conv2d(384, 384, kernel_size=3),

            nn.Conv2d(384, 256, kernel_size=3),
            nn.MaxPool2d(kernel_size=3, stride=2),

        self.classifier = nn.Sequential(
            nn.Linear(256 * 6 * 6, 4096),
            nn.Linear(4096, 4096),
            nn.Linear(4096, num_classes),

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x


The original breakout convolutional neural network (CNN) introduced by Alex Krizhevsky et al. in 2012, which achieved SOTA top-1 and top-5 error rates of 37.5% and 17.0% on the 2010 ImageNet dataset respectively. This model featured a much deeper neural network architecture, the use of ReLU (Rectified Linear Unit) activation functions, and a dropout regularization technique to prevent overfitting.


ReLU activations (f(x)=max(0,x)f(x) = \max(0, x)) are used in favor of the standard tanh activation function (f(x)=tanh(x)f(x) = \tanh(x)), as the ReLU-based models are able to be trained several times faster than the tanh-based models.

Local response normalization is given by the expression bc=ac(k+αnc=max(0,cn/2)min(N1,c+n/2)(ac)2)βb_c = a_c \left(k + \frac{\alpha}{n} \sum_{c'=\max(0,c-n/2)}^{\min(N-1,c+n/2)} (a_{c'})^2\right)^{-\beta} where the hyper-parameters are set as k=2,n=5,α=104k = 2, n = 5, \alpha = 10^{-4} and β=1.5\beta = 1.5. Using response normalization reduces the top-1 and top-5 error rates by 1.4% and 1.2% respectively.

Pooling layers are used to summarize the feature maps of the convolutional layers. A pooling layer can be thought of as consisting of a grid of pooling units spaced ss pixels apart, each summarizing a neighborhood of size z×zz \times z centered at the grid unit's location. Overlapping pooling is used with s=2s = 2 and z=3z = 3, which reduces the top-1 and top-5 error rates by 0.4% and 0.3% respectively.

Reponse normalization layers follow the first and second convolutional layers, and are followed by a pooling layer. The fifth convolutional layer is also followed by a max-pooling layer.

