Replica
[AlexNet] ImageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton
import torch
import torch.nn as nn
class AlexNet(nn.Module):
def __init__(self, num_classes=1000) -> None:
super(AlexNet, self).__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 96, kernel_size=11, stride=4),
nn.ReLU(inplace=True),
nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(96, 256, kernel_size=5),
nn.ReLU(inplace=True),
nn.LocalResponseNorm(size=5, alpha=0.0001, beta=0.75, k=2),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(256, 384, kernel_size=3),
nn.ReLU(inplace=True),
nn.Conv2d(384, 384, kernel_size=3),
nn.ReLU(inplace=True),
nn.Conv2d(384, 256, kernel_size=3),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
)
self.classifier = nn.Sequential(
nn.Dropout(0.5),
nn.Linear(256 * 6 * 6, 4096),
nn.ReLU(inplace=True),
nn.Dropout(0.5),
nn.Linear(4096, 4096),
nn.ReLU(inplace=True),
nn.Linear(4096, num_classes),
)
def forward(self, x):
x = self.features(x)
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
Introduction
The original breakout convolutional neural network (CNN) introduced by Alex Krizhevsky et al. in 2012, which achieved SOTA top-1 and top-5 error rates of 37.5% and 17.0% on the 2010 ImageNet dataset respectively. This model featured a much deeper neural network architecture, the use of ReLU (Rectified Linear Unit) activation functions, and a dropout regularization technique to prevent overfitting.
Architecture
ReLU activations () are used in favor of the standard tanh activation function (), as the ReLU-based models are able to be trained several times faster than the tanh-based models.
Local response normalization is given by the expression where the hyper-parameters are set as and . Using response normalization reduces the top-1 and top-5 error rates by 1.4% and 1.2% respectively.
Pooling layers are used to summarize the feature maps of the convolutional layers. A pooling layer can be thought of as consisting of a grid of pooling units spaced pixels apart, each summarizing a neighborhood of size centered at the grid unit's location. Overlapping pooling is used with and , which reduces the top-1 and top-5 error rates by 0.4% and 0.3% respectively.
Reponse normalization layers follow the first and second convolutional layers, and are followed by a pooling layer. The fifth convolutional layer is also followed by a max-pooling layer.
Training
Waiting for datasets to be downloaded. Will be trained on the ILSVRC 2010 dataset for object classification and ILSVRC 2012 dataset for object localization and object detection.
Results
WIP