Convolutional networkswere inspired by biological processes in which theconnectivity pattern between neurons is inspired by the organizationof the animal visual cortex. Individual cortical neurons respondto stimuli only in a restricted region of the visual field known asthe receptive field. The receptive fields of different neurons partiallyoverlap such that they cover the entire visual field.ConvolutionalNeural Networks recorded amazingly good performance in several tasks, includingdigit recognition, image classification and face recognition. The key ideabehind CNNs is to automatically learn a complex model that is able to extractvisual features from the pixel-level content, exploiting a sequence of simpleoperations such as filtering, local contrast normalization, non-linearactivation, local pooling.
Traditional methods use handcrafted features, that is, the feature extractionpipeline is the result of human intuitions and understanding of the raw data. Forinstance, the Viola-Jones 22 features come from the observation that theshape of a pedestrian is characterized by abrupt changes of pixel intensity inthe regions corresponding to the contour of the body.Conversely,Convolutional Neural Networks do not exploit human intuitions but only rely onlarge training datasets and a training procedure based on back propagation,coupled with an optimization algorithm such as gradient descent. The trainingprocedure aims at automatically learning both the weights of the filters, sothat they are able to extract visual concepts from the raw image content, and asuitable classifier. The first layers of the network typically identifylow-level concepts such as edges and details, whereas the final layers are ableto combine low-level features so as to identify complex visual concepts.
Convolutional Neural Networks are typically trained resorting to a supervisedprocedure that, besides learning ad-hoc features, defines a classifier as thelast layer of the network. Despite being powerful and effective, theinterpretability of such models is limited. Moreover, being very complex modelconsisting of up to hundreds of millions of parameters, CNNs need largeannotated training datasets to yield accurate results 23. In the context of pedestrian detection, thelast layer typically consists of just one neuron, and acts as a binaryclassifier that determines whether an input region depicts a pedestrian. Thehigher the output of such neuron, the higher the probability of thecorresponding region containing a pedestrian.
Binary classification is obtainedby properly thresholding the output score of such neuron.