Deep Learning Flashcards
Sparse coding
human and ai don’t represent things in the same way (eg: image vs pixels)
Sparse coding lets the computer do its own representation. It search for sparse features.
optimization problem.
First term is reconstruction error, second is the regularization term.
X is original data, meaning it is the data in its original
form. Sum of a: how the sparse coding algorithm is going to represent the
same data using the weights ai and the features 𝜙𝑖
(bases vectors, features that are sparse but very informative, chosen by the computer).
Weighted sum is deducted from X
We want to try find bases function phi that allow the approximation of the original data x with weights that most of the time are 0.
How to find these? Optimization problem: first try to minimize the sum of the reconstructions and the second
term is regularization defined on the weights a (low cost for features a where a lot of them are 0). This
regularization term is often called L1 norm. The cost of this norm goes up with the size of a (linearly). We
must add the constraint that phi is small so that all the ai’s are close to 0.
Bases vectors phi and their activations ai make it easier to do certain predictions instead of starting from pixelbased representations (images).
Motivation Deep Learning
If agents want to use high level reasoning, they have to have some kind of representation of the state that
they’re in. Symbol grounding problem: problem of constructing or a view of the environment. What is an
object? What does it mean for something to be an object?
Assumption Q-learning: table-based Q-value representation, a single table entry for every possible stateaction pair.
Deep Learning makes this easier
Good vs Sparse Features
Good features: very expressive features that don’t occur in a large percentages of the data.
Sparse feature: feature that doesn’t occur very often, but if it does, it is very informative.
Autoencoder
human and ai don’t represent things in the same way (eg: image vs pixels)
Autoencoder lets the computer do its own
1 layer of input, 1 layer of constraints (represents the input by finding structures on fewer dimension which allow to the reconstruct the input), 1 layer output: reconstructed input
. Less
nodes in the middle layer: forces the network to represent the same information that is
high-dimensional in the input with only few dimensions in the middle. (better make sure
the features represented by those fewer nodes are very informative). Then the output
nodes need to learn what to be/do if it matches something in one of the fewer nodes.
This bottleneck forces the network to learn informative features. However, design
question: how little nodes do we need?
Autoencoder can denoise images if the input data it has seen in training never had noise then it doesn’t know how to represent noise -> predict missing values
It can also remove certain objects if it has never seen them before -> remove values
Sparse and Varitional autoencoder
Sparse-autoencoders: force a node to only be active in 5% of the cases (whatever data samples we have). This
forces sparsity.
Variational autoencoder: force bottleneck in the middle by adding noise
Stacked Autoencoder
Autoencoders: can be repeated/stacked. Can learn new features that might even be better than first layer of
autoencoder (higher-level features). With this, we can delete part of the network/model to build a prediction
model that use higher-level features. Example: using multiple autoencoders can represent the following
features when having images: first pixels, then edges, then object parts and lastly objects.
-> feature discovery
Stacked produced object neurons find specific objects and then can combine
eg: node sees a certain type of eye, nose etc the active neurons are the one that detect their object then they can combine to go from general face to specific one
Convolutional filters
set of weights applied on image filters
a filter is just a matrix of numbers which then gets applied to a
picture over and over again. New pixel value in middle cell = First value in the filter * first
value in the image + second value in the filter * second value in the image + ….
Examples:
Face detection works by creating a filter of a face that gets applied on an image. The
whole image will turn black except for the part that matches the face in the filter → bright
spot there where there is a face
Convolutional Networks
- Input layer
-edges are shared weights - Second layer are filter neurons of local parts (only some nodes connected to them) of the input (for example, face detection, where an image
is split into 4 parts: 4 corners = 4 neurons). Weights are repeated for every neuron, so they all detect
the same thing. If one weight gets updated, then all weights for shared nodes get updated. - Maxpooling layer: takes the maximum value of the previous nodes for which it is linked. This allows it
to become a global filter (face found in the total picture). This on edge - last node is Global filter (applies on all)
- Softmax: translates the number of features into a probability distribution over a number of classes
that you want to predict
planning & search
simplest AI
try all possible actions from a certain state. Build a search tree, where depth
is the number of moves you look ahead. Best move to play is the move that leads to the highest number of
wins (if the game would be played to the end). However, the problem is that your search tree would be way
too big. So we reduce the size of the search tree: limit the width or limit the depth by learning from either
human or own experience.
Go -> limit width by only visiting the moves often play by experienced players. limit length by predicting outcome
Type of ML
- Classification: look at search and say ‘some of these moves are not played very often by experienced
players’ so they’re probably not very good moves so we can eliminate these moves (if the probability
is too low). - Regression: predict the number of wins a position is expected to generate. Can be a probability or
count.
Saliancy Map
where would humans say is the information in the picture/data. the important info according to them
example: in a picture you will look at people and signs but not background
GO
Fully observable, not stochastic
policy is best action at ..
hard to find a board evalution function -> generalize function
So many possible moves, impossible to represent all in a tree
before deep learning, we didn’t know how to represent the state/game.
First model had accuracy 57% using supervised learning: classification.
To improve the model, they let the model play itself: actions that lead to a win are
made more likely, and actions that lead to a loss are made less likely. To prevent a
mess, they always fixed one player → didn’t change its policy (avoid messy part of
MARL). 1.1 vs 1.2, 1.2 vs 1.3, 1.3 vs 1.7 etc. They did this for a week, until the last version won 80% of the
games vs the first one. This was reinforcement learning policy, using a policy gradient algorithm, where policy
is represented by a neural net. Z = 1 if game is won, z = -1 if game is lost.
Human expert positions -> supervised learning policy network -> self play -> reinforcement policy network -> self play -> self play data -> value network -> MCTS to search intelligently
Learning parameters NN
Learning the parameters in a NN is done by taking a random initial value for the parameters and then adjusting the weights to minimize the cost function (for an autoencoder it’s going to be the inaccuracies in the reconstruction that make up the cost). Once the cost function is defined, the NN is going to use examples to calculate the cost and the gradient of the cost to find a global or local minimum → Finding the minimum of a function can be done by following the gradient: the local value of the derivative.
Now the NN has a specific structure, where each layer’s activation is a function of the activations of the
previous layer and the same goes for that layer. Thus the gradient is computed by the chain rule = product of derivatives. However, multiplying numbers is very unstable if they are not 1 (will make the gradient explode).
If some of them or only one of these is 0, the whole gradient is 0. And this gradient is what is going to carry
information across the network, so if this is 0, a lot of nodes are not going to know what to change.
thus to help: ReLU -> rectified linear units
a function that has gradient 0 for a large
part (if the node is not contributing, it should not change), and gradient 1 for if the node is contributing.
Early networks used a nonlinear (logistic or tanh) functions for every node in the network, where the
derivative was always smaller than ¼ (max derivative at x =0), which was a huge problem.
Limiting the width & depth GO
width:
condsider position with the proba of playing certain moves. Take from online games and expert playing go
depth:
use the same architecture but add an additional regression layer to predict the win
probability for that state: 0 prediction means bad position, 1 predictions means excellent position from which
you’ll definitely win. The gradient is driven by how much my current estimate of who’s going to win differ
from the observation of who actually won the game. Z is used to make the step
along the gradient larger or smaller by seeing how much it deviates from the current
estimate made by the network: supervised learning regression.
Not sufficient by itself ! Need to search intelligently -> Monte Carlo
Monte carlo tree search
a way to search the search tree intelligently.
Four phases: selection, expansion,
simulation, backpropagation.
Selection: which moves would be most likely to be played (which part of the subtree to be investigated further).
Expansion: need to evaluate the ‘end’ node. So we play out the rest of the
game ‘randomly’ → simulation. We do multiple simulations to estimate how likely the game is won from that position.
Backpropagation: combine search and predictions already made and adjust the likelihood of winning
of all moves above the examined node.