chapter 4 Flashcards
What’s so hard about object recognition? (dogs)
If the input is simply the pixels of the image, then the program first has to figure out which are “dog” pixels and which are “non-dog” pixels
different dogs look very different
they can be facing in various directions
the lighting can vary considerably between images;
parts of the dog can be blocked by other objects
dog pixels” might look a lot like “cat pixels” or other animals.
Under some lighting conditions, a cloud in the sky might even look very much like a dog.
The deep-learning revolution
The ability of machines to recognize objects in images and videos underwent a quantum leap in the 2010s due to advances in the area called deep learning.
deep learning
methods for training deep neural networks
most successful deep networks are those …
whose structure mimics parts of the brain’s visual system.
Fukushima - cognitron and neocognitron
inspired by David Hubel and Torsten Wiesel’s discoveries of hierarchical organization in visual systems
Fukushima reported some success training the neocognitron to recognize handwritten digits, but the specific learning methods he used did not seem to extend to more complex visual tasks.
the neocognitron was an important inspiration for later approaches to deep neural networks, including today’s most influential and widely used approach: convolutional neural networks
object recognition in the brain
- when eyes focus on a scene ⇒ receive light of different wavelengths that has been reflected by the objects and surfaces in the scene
- activates cells in each retina ⇒ grid of neurons in the back of the eye communicate their activation through the optic nerves and into brain, eventually activating neurons in the visual cortex, which resides in the back of the head
- visual cortex is roughly organized as a hierarchical series of layers of neurons ⇒ neurons in each layer communicate their activations to neurons in the succeeding layer
object recognition
recognizing a particular group of pixels in an image as a particular object category (e.g. chair, dog, balloon etc.)
ConvNets
inspired by neocognitron
the driving force behind todays deep learning revolution
first proposed by Yann LeCunn in the 1980s
Object Recognition in CovNets
recognizing a particular group of pixels in an image as a particular object category (e.g. chair, dog, balloon etc.)
neurons in different layers of this hierarchy act as “detectors” that respond to increasingly complex features appearing in the visual scene
there is a bottom-up (or feed-forward) flow of information, representing connections from lower to higher layers
units in each layer provide input to units in the next layer
ConvNet input
image ⇒ array of numbers corresponding to the brightness and color of the image’s pixels
ConvNet output
network’s confidence (0 percent to 100 percent) for each category (“dog” and “cat”)
ConvNet goal
have the network learn to output a high confidence for the correct category and a low confidence for the other category
the network will learn what set of features of the input image are most useful for this task
activation maps
each layer of the network is represented by a set of three overlapping rectangles. These rectangles represent activation maps
units in a ConvNet act as detectors for important visual features
each unit looks for its designated feature in a specific part of the visual field
each layer in ConvNet consists of several grids of these units
each grid forms an activation map for a specific visual feature
edge detectors
each neuron only looks at part of the visual scene (its receptive field)
the neuron becomes active only if its receptive field contains a particular kind of edge (e.g. horizontal or vertical)
feed into higher level processing regions where neuron might detect certain shapes, objects or faces
Each unit in each map calculates an activation value that measures the degree to which the region “matches” the unit’s preferred edge orientation
receptive field of a unit
Each unit in a map corresponds to the analogous location in the input image, and each unit gets its input from a small region around that location —its receptive field.
(The receptive fields of neighboring units typically overlap.)
convolution
multiplying each value in a receptive field by its corresponding weight and summing the results
Image patches inside receptive fields are arrays of pixel values.
Each unit receives as input the pixel values in its receptive field. The unit then multiplies each input by its weight and sums the results to produce the unit’s activation
A key to the ConvNet’s success is that—again, inspired by the brain—these maps are hierarchical
classification module
layers 1 to 4 of network are called convolutional layers because each performs convolutions on the preceding layer
At this point, it’s time for the classification module to use these features to predict what object the image depicts.
The classification module is actually an entire traditional neural network
inputs to classification module are activation maps from the highest convolutional layer
output is a set of percentage values, one for each possible category, rating the network’s confidence that the input depicts an image of that category
explain what a convnet is
Inspired by hubel and wiesel’s findings on the brain’s visual cortex, a convnet takes an input image and transforms it – via convolutions – into a set of activation maps with increasingly compex features.
The features at the highest convolutional layers are fed into a traditional neural network (classification module), which outputs confidence percentages for the network’s known object categories.
The object category with the highest confidence is returned as the network’s classification of the image.
Training a convnet
in real-world ConvNets edge detectors aren’t built in.
Instead, ConvNets learn from training examples what features should be detected at each layer and how to set the weights in the classification module
just as in traditional neural networks, all the weights can be learned from data via back- propagation
training a convnet
collect a training set
label each image in the training set
Your training program initially sets all the weights in the network to random values.
one by one, each image is given as the input to the network; the network performs its layer- by-layer calculations and finally outputs confidence percentages for one of the categories
For each image, your training program compares these output values to the “correct” values
Then the training program uses the back-propagation algorithm to change the weights throughout the network just a bit, so that the next time this image is seen, the confidences will be closer to the correct values.
epoch
input an image, then calculate the error at the output, then change the weights
Training a ConvNet requires many epochs, during which the network processes each image over and over again