Back

Replica

Replica Logo

[AlexNet] ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton

AbstractModel
import torch
import torch.nn as nn

class AlexNet(nn.Module):
    def __init__(self, num_classes=1000) -> None:
        super(AlexNet, self).__init__()

        self.features = nn.Sequential(
            nn.Conv2d(3, 96, kernel_size=11, stride=4),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),

            nn.Conv2d(96, 256, kernel_size=5),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2),
            nn.MaxPool2d(kernel_size=3, stride=2),

            nn.Conv2d(256, 384, kernel_size=3),
            nn.ReLU(inplace=True),

            nn.Conv2d(384, 384, kernel_size=3),
            nn.ReLU(inplace=True),

            nn.Conv2d(384, 256, kernel_size=3),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
        )

        self.classifier = nn.Sequential(
            nn.Dropout(0.5),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(0.5),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

Introduction

The original breakout convolutional neural network (CNN) introduced by Alex Krizhevsky et al. in 2012, which achieved SOTA top-1 and top-5 error rates of 37.5% and 17.0% on the 2010 ImageNet dataset respectively. This model featured a much deeper neural network architecture, the use of ReLU (Rectified Linear Unit) activation functions, and a dropout regularization technique to prevent overfitting.

Architecture

ReLU activations (f(x)=max(0,x)f(x) = \max(0, x)) are used in favor of the standard tanh activation function (f(x)=tanh(x)f(x) = \tanh(x)), as the ReLU-based models are able to be trained several times faster than the tanh-based models.

Local response normalization is given by the expression bc=ac(k+αnc=max(0,cn/2)min(N1,c+n/2)(ac)2)βb_c = a_c \left(k + \frac{\alpha}{n} \sum_{c'=\max(0,c-n/2)}^{\min(N-1,c+n/2)} (a_{c'})^2\right)^{-\beta} where the hyper-parameters are set as k=2,n=5,α=104k = 2, n = 5, \alpha = 10^{-4} and β=1.5\beta = 1.5. Using response normalization reduces the top-1 and top-5 error rates by 1.4% and 1.2% respectively.

Pooling layers are used to summarize the feature maps of the convolutional layers. A pooling layer can be thought of as consisting of a grid of pooling units spaced ss pixels apart, each summarizing a neighborhood of size z×zz \times z centered at the grid unit's location. Overlapping pooling is used with s=2s = 2 and z=3z = 3, which reduces the top-1 and top-5 error rates by 0.4% and 0.3% respectively.

Reponse normalization layers follow the first and second convolutional layers, and are followed by a pooling layer. The fifth convolutional layer is also followed by a max-pooling layer.

Model diagram

Training

Waiting for datasets to be downloaded. Will be trained on the ILSVRC 2010 dataset for object classification and ILSVRC 2012 dataset for object localization and object detection.

Results

WIP