Deep Learning Flashcards

Question 1

Q

Sparse coding

Answer

A

human and ai don’t represent things in the same way (eg: image vs pixels)
Sparse coding lets the computer do its own representation. It search for sparse features.

optimization problem.

First term is reconstruction error, second is the regularization term.

X is original data, meaning it is the data in its original
form. Sum of a: how the sparse coding algorithm is going to represent the
same data using the weights ai and the features 𝜙𝑖
(bases vectors, features that are sparse but very informative, chosen by the computer).
Weighted sum is deducted from X
We want to try find bases function phi that allow the approximation of the original data x with weights that most of the time are 0.

How to find these? Optimization problem: first try to minimize the sum of the reconstructions and the second
term is regularization defined on the weights a (low cost for features a where a lot of them are 0). This
regularization term is often called L1 norm. The cost of this norm goes up with the size of a (linearly). We
must add the constraint that phi is small so that all the ai’s are close to 0.
Bases vectors phi and their activations ai make it easier to do certain predictions instead of starting from pixelbased representations (images).

Question 2

Q

Motivation Deep Learning

Answer

A

If agents want to use high level reasoning, they have to have some kind of representation of the state that
they’re in. Symbol grounding problem: problem of constructing or a view of the environment. What is an
object? What does it mean for something to be an object?
Assumption Q-learning: table-based Q-value representation, a single table entry for every possible stateaction pair.

Deep Learning makes this easier

Question 3

Q

Good vs Sparse Features

Answer

A

Good features: very expressive features that don’t occur in a large percentages of the data.

Sparse feature: feature that doesn’t occur very often, but if it does, it is very informative.

Question 4

Q

Autoencoder

Answer

A

human and ai don’t represent things in the same way (eg: image vs pixels)
Autoencoder lets the computer do its own

1 layer of input, 1 layer of constraints (represents the input by finding structures on fewer dimension which allow to the reconstruct the input), 1 layer output: reconstructed input

. Less
nodes in the middle layer: forces the network to represent the same information that is
high-dimensional in the input with only few dimensions in the middle. (better make sure
the features represented by those fewer nodes are very informative). Then the output
nodes need to learn what to be/do if it matches something in one of the fewer nodes.
This bottleneck forces the network to learn informative features. However, design
question: how little nodes do we need?

Autoencoder can denoise images if the input data it has seen in training never had noise then it doesn’t know how to represent noise -> predict missing values
It can also remove certain objects if it has never seen them before -> remove values

Question 5

Q

Sparse and Varitional autoencoder

Answer

A

Sparse-autoencoders: force a node to only be active in 5% of the cases (whatever data samples we have). This
forces sparsity.

Variational autoencoder: force bottleneck in the middle by adding noise

Question 6

Q

Stacked Autoencoder

Answer

A

Autoencoders: can be repeated/stacked. Can learn new features that might even be better than first layer of
autoencoder (higher-level features). With this, we can delete part of the network/model to build a prediction
model that use higher-level features. Example: using multiple autoencoders can represent the following
features when having images: first pixels, then edges, then object parts and lastly objects.

-> feature discovery

Stacked produced object neurons find specific objects and then can combine

eg: node sees a certain type of eye, nose etc the active neurons are the one that detect their object then they can combine to go from general face to specific one

Question 7

Q

Convolutional filters

Answer

A

set of weights applied on image filters

a filter is just a matrix of numbers which then gets applied to a
picture over and over again. New pixel value in middle cell = First value in the filter * first
value in the image + second value in the filter * second value in the image + ….
Examples:
Face detection works by creating a filter of a face that gets applied on an image. The
whole image will turn black except for the part that matches the face in the filter → bright
spot there where there is a face

Question 8

Q

Convolutional Networks

Answer

A

Input layer
-edges are shared weights
Second layer are filter neurons of local parts (only some nodes connected to them) of the input (for example, face detection, where an image
is split into 4 parts: 4 corners = 4 neurons). Weights are repeated for every neuron, so they all detect
the same thing. If one weight gets updated, then all weights for shared nodes get updated.
Maxpooling layer: takes the maximum value of the previous nodes for which it is linked. This allows it
to become a global filter (face found in the total picture). This on edge
last node is Global filter (applies on all)
Softmax: translates the number of features into a probability distribution over a number of classes
that you want to predict

Question 9

Q

planning & search

Answer

A

simplest AI

try all possible actions from a certain state. Build a search tree, where depth
is the number of moves you look ahead. Best move to play is the move that leads to the highest number of
wins (if the game would be played to the end). However, the problem is that your search tree would be way
too big. So we reduce the size of the search tree: limit the width or limit the depth by learning from either
human or own experience.

Go -> limit width by only visiting the moves often play by experienced players. limit length by predicting outcome

Question 10

Q

Type of ML

Answer

A

Classification: look at search and say ‘some of these moves are not played very often by experienced
players’ so they’re probably not very good moves so we can eliminate these moves (if the probability
is too low).
Regression: predict the number of wins a position is expected to generate. Can be a probability or
count.

Question 11

Q

Saliancy Map

Answer

A

where would humans say is the information in the picture/data. the important info according to them

example: in a picture you will look at people and signs but not background

Question 12

Q

GO

Answer

A

Fully observable, not stochastic

policy is best action at ..

hard to find a board evalution function -> generalize function
So many possible moves, impossible to represent all in a tree

before deep learning, we didn’t know how to represent the state/game.
First model had accuracy 57% using supervised learning: classification.
To improve the model, they let the model play itself: actions that lead to a win are
made more likely, and actions that lead to a loss are made less likely. To prevent a
mess, they always fixed one player → didn’t change its policy (avoid messy part of
MARL). 1.1 vs 1.2, 1.2 vs 1.3, 1.3 vs 1.7 etc. They did this for a week, until the last version won 80% of the
games vs the first one. This was reinforcement learning policy, using a policy gradient algorithm, where policy
is represented by a neural net. Z = 1 if game is won, z = -1 if game is lost.

Human expert positions -> supervised learning policy network -> self play -> reinforcement policy network -> self play -> self play data -> value network -> MCTS to search intelligently

Question 13

Q

Learning parameters NN

Answer

A

Learning the parameters in a NN is done by taking a random initial value for the parameters and then
adjusting the weights to minimize the cost function (for an autoencoder it’s going to be the inaccuracies in the
reconstruction that make up the cost). Once the cost function is defined, the NN is going to use examples to
calculate the cost and the gradient of the cost to find a global or local minimum → Finding the minimum of a
function can be done by following the gradient: the local value of the derivative.

Now the NN has a specific structure, where each layer’s activation is a function of the activations of the
previous layer and the same goes for that layer. Thus the gradient is computed by the chain rule = product of derivatives. However, multiplying numbers is very unstable if they are not 1 (will make the gradient explode).
If some of them or only one of these is 0, the whole gradient is 0. And this gradient is what is going to carry
information across the network, so if this is 0, a lot of nodes are not going to know what to change.

thus to help: ReLU -> rectified linear units
a function that has gradient 0 for a large
part (if the node is not contributing, it should not change), and gradient 1 for if the node is contributing.

Early networks used a nonlinear (logistic or tanh) functions for every node in the network, where the
derivative was always smaller than ¼ (max derivative at x =0), which was a huge problem.

Question 14

Q

Limiting the width & depth GO

Answer

A

width:
condsider position with the proba of playing certain moves. Take from online games and expert playing go

depth:
use the same architecture but add an additional regression layer to predict the win
probability for that state: 0 prediction means bad position, 1 predictions means excellent position from which
you’ll definitely win. The gradient is driven by how much my current estimate of who’s going to win differ
from the observation of who actually won the game. Z is used to make the step
along the gradient larger or smaller by seeing how much it deviates from the current
estimate made by the network: supervised learning regression.

Not sufficient by itself ! Need to search intelligently -> Monte Carlo

Question 15

Q

Monte carlo tree search

Answer

A

a way to search the search tree intelligently.
Four phases: selection, expansion,
simulation, backpropagation.

Selection: which moves would be most likely to be played (which part of the subtree to be investigated further).

Expansion: need to evaluate the ‘end’ node. So we play out the rest of the
game ‘randomly’ → simulation. We do multiple simulations to estimate how likely the game is won from that position.

Backpropagation: combine search and predictions already made and adjust the likelihood of winning
of all moves above the examined node.

Question 16

Q

Alpha Go Zero

Answer

Study These Flashcards

A

no more pre-training, only self-play. No more multiple networks: 1 network for both actionprobability and state-value, and no more simulations during MCTS, only state evaluation according to neural
network.

AlphaGo Zero is very impressive (it found moves that are not played by human players, but actually perform

better) , but what’s still missing:
- Go is deterministic, fully observable and discrete
- Perfect simulator of environment available
- Each game is relatively short
- Evaluation of outcomes is clear

Question 17

Q

AI for StarCraft

Answer

Study These Flashcards

A

continuous action space, non-Markovian
environment.

Believe states: this is the state that the agent believes the world to be in. With this you kind of get rid
of the fact that you need to explicitly model the whole history of how you ended up in that state.
Recurrent neural networks: they have memory and thus circumvent the fact that immediate state is
not sufficient.
Attention model: which part of the modelled history should be paid attention to for predicting which
action to execute.

Question 18

Q

Relation reinforcement learning

Answer

Study These Flashcards

A

in different situations, objects are/act differently. In alphastar: looking at
activations of nodes in the deep network over time, and if a node would be active in a certain sequence of
levels of activation, that would often mean that there is some kind of power behind this so maybe some kind
of behaviour of a complex object in the world.

Deep Learning Flashcards

(18 cards)