Lecture 6:Deep Learning, CNNs Flashcards
What are the 3 major breakthroughs in deep learning?
- Speech Recognition & machine translation(2010+)
2.Image Recognition & computer vision(2012+)
3.Natural language processing (2014+)
How to compute input to hidden?
- compute net activation net = x*W + b
x–>inputs
W–>weights
b–>bias weights
2.compute activation function : h=S(neth)
How to compute hidden to output?
o=S(h*W + b)
What are the three initial drawbacks?
1.Standard back propagation with sigmoid activation does not scale well with multiple layers
2.Overfitting
3.Multilayered ANNs need lots of labeled data
What are the two types of problems when multiplying the gradients many times for each layer?
- Vanishing gradient problem
2.Exploding gradient problem
What does the vanishing gradient problem consist of(4)?
-gradients shrink exponentially with nb of layers–>weight updates get smaller–>weights of early layers change very slowly –> learning very slow
What does the exploding gradient problem consist of(4)?
-multiplying gradients makes them grow exponentially–>weight updates get larger and larger–>weights become so large as to overflow and result in NaN values
What are the two solutions to initial drawback #1?
1.Use other activation functions
2.Do gradient clipping: set bounds on the gradients
What is overfitting(second initial drawback)?
-Large network–> lots of parameters–>increased capacity to learn by heart
What are the 2 solutions to overfitting?
1.Regularization
2.Dropout
What does regularization consist of?
-modify the error function that we minimize to penalize large weights
What does dropout consist of(2)?
-keep a neutron active with some probability p or setting it to 0 otherwise
-prevents the network from becoming too dependent on any one neuronal
What is the problem with the third initial drawback?
Most data is not labeled
What is the solution to the third initial drawback?
Pre-train the network with features found automatically using unsupervised data –> automatic feature learning
What is Classic ML?
-Manual extraction of features
What does classic ML require?
-Labeled data and hand-crafted features
What are 3 cons of classic ML?
-Needs expert knowledge
-Time-consuming and expensive
-Dos not generalize to other domains
What does automatic feature learning consist of?
Each layer learns more abstract features that are then combined/composed into higher-level features automatically
What are 3 pros of automatic feature learning?
-We feed the network the raw data
-The features are learned by the network
-Features learned can be re-used in similar tasks
What are 5 advantages of unsupervised feature learning?
-more unlabeled data available than labeled data
-Humans learn first from unlabeled examples
-less risk of over-fitting
-no need for manual feature engineering
-features are organized into multiple layers : each level creates new features from combinations of features from level below + more abstract than the ones below (hierarchy of features)
What are the 2 steps of the general architectures of a deep network?
1.Unsupervised pre-training of neural network using unlabeled data eg.autoencoder
2.Supervised training with labeled data using features learned from above with a standard classifier eg.ANN
What are 2 ways to learn a representation of the data(1st step)?
-Deep Belief Networks (mid 2000s)
-Autoencoders(2006)
What is a CNN?
Convolutional Neural Network
What does the convolutional layer consist of?
-Uses a filter/kernel that convolves on the image
-The filter is a small weight matrix to learn
What is the objective of the convolutional layer?
The network learns the values of the filter(s) that activate when they see some visual feature that is useful to identify the object(final classification)
What are the 2 convolution hyper-parameters?
1.Stride
2.Padding
What is stride?
How many steps do you move the filter every time
What is padding?
Filter should pick up high values surrounded by low values
What are pooling layers used for(2)?
-to reduce the size of the activation maps
-so that we reduce the nb of parameters of the network and avoid overfitting
What is max pooling?
Similar to the convolution step, but instead of performing a matrix multiplication, max pooling takes the maximum value within the window.
What is average pooling?
Taking the average value over an input window for each channel of the input.
What is the architecture of a CNN?
1.Stack:
-convolutional layers
-pooling layers
2.Finish off with a fully connected layer at the end for final classification
What are 2 examples of successful CNN networks?
LeNet:
-first successful applications of CNNs
-1990s
-used to read zip codes
AlexNet:
-first work that made CNNs popular for computer vision
History of AI…
Artificial Intelligence(1950s- 1980s)–> rules written by experts:(
Machine Learning(1980s-2010s)–>rules learns via the data BUT features identified by experts
Deep Learning(2010s-)–>rules AND features learned from the data