were inspired by biological processes in which the
connectivity pattern between neurons is inspired by the organization
of the animal visual cortex. Individual cortical neurons respond
to stimuli only in a restricted region of the visual field known as
the receptive field. The receptive fields of different neurons partially
overlap such that they cover the entire visual field.
Neural Networks recorded amazingly good performance in several tasks, including
digit recognition, image classification and face recognition. The key idea
behind CNNs is to automatically learn a complex model that is able to extract
visual features from the pixel-level content, exploiting a sequence of simple
operations such as filtering, local contrast normalization, non-linear
activation, local pooling. Traditional methods use handcrafted features, that is, the feature extraction
pipeline is the result of human intuitions and understanding of the raw data. For
instance, the Viola-Jones 22 features come from the observation that the
shape of a pedestrian is characterized by abrupt changes of pixel intensity in
the regions corresponding to the contour of the body.
Convolutional Neural Networks do not exploit human intuitions but only rely on
large training datasets and a training procedure based on back propagation,
coupled with an optimization algorithm such as gradient descent. The training
procedure aims at automatically learning both the weights of the filters, so
that they are able to extract visual concepts from the raw image content, and a
suitable classifier. The first layers of the network typically identify
low-level concepts such as edges and details, whereas the final layers are able
to combine low-level features so as to identify complex visual concepts.
Convolutional Neural Networks are typically trained resorting to a supervised
procedure that, besides learning ad-hoc features, defines a classifier as the
last layer of the network. Despite being powerful and effective, the
interpretability of such models is limited. Moreover, being very complex model
consisting of up to hundreds of millions of parameters, CNNs need large
annotated training datasets to yield accurate results 23.
In the context of pedestrian detection, the
last layer typically consists of just one neuron, and acts as a binary
classifier that determines whether an input region depicts a pedestrian. The
higher the output of such neuron, the higher the probability of the
corresponding region containing a pedestrian. Binary classification is obtained
by properly thresholding the output score of such neuron.