ML-Final Flashcards
What is threshold logical unit
Simple model of a neuron
Each input value is multiplied with the corresponding weight value, and these weighted values are then summed.
If the weighted summed input is larger than a certain threshold value, then the output is set to one, and zero otherwise
What is weight parameter
representing the ‘strength’ of a connection
What is a perceptron
where the output is calculated from the weighted summed input with a activation function(gain function, transfer function, output function, activation function. )
Give examples of gain function, transfer function, output function, activation function.
??? sigmoid, tanh
Why we add bias term to perceptron
A bias allows a perceptron to shift the prediction to better fit.
Similarities of SVM and perceptron ?
Linear SVM is a special case of a perceptron
What is the difference between Deep Learning and SVM?
SVM solve the optimization problem with specific transformations of the feature space.
Deep learning will aim at learning the appropriate transformations.
What is delta term(delta rule?)
δ = (y(i) − y)y(1 − y)
Delta rule is a gradient descent learning rule for updating the weights of the inputs
Why a multilayer feedforward network is a universal function approximator
there is guaranteed to be a neural network so that for every possible input, (x) the value f(x) is output from the network
given enough hidden nodes, any functions can be approximated with arbitrary precision by these networks
What is error-back-propagation or backpropagation
calculation of the gradient proceeds backwards through the network, gradient of the final layer of weights being calculated first and the gradient of the first layer of weights being calculated last
Batch, mini-batch and online
online avoiding local minima, mini-batch large datasets. Batch high memory space.
No free lunch” theorem
no one model that works best for every problem.
The assumptions of a great model for one problem may not hold for another problem
What is cross entropy? (The negative log probability) what is it used for ?
The negative log probability of the given label times the current model(probability distribution)
H(p,q) = − sum[ p(y) log q(y) ]
q: true nature of data
p: The neural network model represents the probability p(y|x; w)
Derive learning rule
KL-divergence, what is it equivalent to? What are they related to? What are they used in neural network ?
related to cross entropy
H(p, q) = H(p) + KL(p||q)
minimizing the cross entropy is equivalent to minimizing the KL-divergence
both are closely related to the maximum (log) likelihood principle
use to generate learning rule?
What is softmax function, why and where does it used in neural network?
Softmax function is a generalization of the logistic function that “squeeze” the output in the range (0, 1)
It is used to highlight the largest values and suppress values which are significantly below the maximum value in a neural network.
final layer of a neural network
How neural networks are related to probabilistic regression?
cross entropy, KL-divergence
What is the relationship of maximize the log probability and cross entropy
That is, we want to maximize the log probability of the data given the labels. Since the cross entropy is the negative of this, maximizing the log probability of the data given the labels is equivalent of minimizing the cross entropy
What is deep learning?
Deep learning basically refer to neural networks with many layers
What is a filter in CNN?
It is a vector describing a pattern
What is convolution?
Convolution is the operation of multiplying and adding while shifting the filter
What is a stride in CNN?
It is how many steps you shift the filter in each iteration of the convolution
What is a pooling operation in convolutional neural networks and why is this operation important?
Pooling is taking the average or the maximum of the previous output in a certain area of the filtered image.
It compress down the image and high-level representation.
This is usually called downsampling, this operation is important is because it reduce the dimensionality of features and computational cost .
Also helps to prevent overfitting.
Briefly explain `dropout’ and why it is used in deep networks.
Dropout: Randomly (e.g. p=0.5) ignoring hidden node for a specific input during learning. temporarily turned off
The reason we use it is that it is a regularization technique that helps to prevent overfitting.
Sparse representation, comprised representation and fully distributed representation.
???
What is autoencoder? Why it is useful?
An autoencoder is a neural network that tries to reconstruct its input.
it is a feature extraction algorithm it helps us find a representation for our data
What is the relation between Ridge regression and a Gaussian prior?
Ridge regression use the L2 regularization, and the L2 regularization is equivalent to a Gaussian prior.
What is batch normalization ?
Batch normalization: normalize the input to each hidden layer over each mini-batch
What is skip connections in neural network?
the process to skip the convolutional layers in the network
What is Recurrent Neural Networks and Where is the term ‘recurrent’ comes from?? What is used for ?
Recurrent Neural Networks perform the same task for every element of a sequence.
It used for sequence processing eg, for machine translation
Explain what is backpropagation-though-time in RNN?
????
What is Gated Neural Network?
A gated recurrent network has an extra memory state(namely gated) that will be carried from the current step to the next step.
A forgetting gate and a write gate can modify its value.
An example of such neural network would be LSTM (Long Short Term Memory ) or a gated recurrent unit (GRU) s
What is Boltzmann machine ? What is the challenge of it?
Special form of recurrent network that the connections between nodes are symmetric
The challenge is finding practical training rules
What is reinforcement learning (RL)?
A learning system with action and reward.
In reinforcement learning, what is a policy?
A policy in reinforcement learning is use to determine the action to take in each state.
What are the RL challenges ?
- Credit assignment
2. Exploration versus exploitation trade-off.
What is Markov condition ? or the Markov Decision Process? (same as transition function in RL)
transition function only depend on the previous state and the intended action from the previous state
What is Reward function in RL?
rt+1 = ρ(st, at)
returns the value of reward when the agent is entering state st+1 by taking action at from state st
What is Policy in RL?
A policy in reinforcement learning is use to determine the action to take in each state. Policy: at = π(st)
Value function and Optimal Value function
Reward and disconnect reward
this functions tells us how good is action a in state s
Value function (state-action): Qπ(s, a) Value function (state): V π (s) = Qπ (s, π(s)) Optimal Value function: V ∗(s) = max Q∗(s, a),
Optimal policy
Optimal policy: π∗(s) = arg max Q∗(s, a).
What is Model-based Reinforcement Learning ?
we assume that the agent has a model of the environment and its behaviour by knowing the reward function ρ(s, a) and the transfer functions τ (s, a).
What is Model-free Reinforcement Learning ?
???
SARSA
??
Q-learning
??
Explain the difference between the SARSA and Q-Learning algorithm.
SARSA is an on-policy approach of RL.
in the part where γ Q (st+1, at+1) we know that its use the previous policy to generate the next policy.
Namely State-Action-Reward-State-Action.
Q-learning is an off-policy approach in RL.
γ m a xaQ (si+1, a′) is the part that is different than ASRSA.
Here, we do not limit the how the next action is selected which means the policy generated in Q-learning is not depends on the previous policy.
epsilon-greedy policy ?
??
What is the difference between on-policy and off policy?
????
basic Bellman equation?
???
What can we learn about SARSA and Q-Learning ?
SARSA will avoid the mistake due to exploration, and Q- learning still have the ability to learn with different exploring policy.
What is reward function in RL? What is transfer function ?
reward function ρ(s, a) and the transfer functions τ (s, a).
What is non-Markovian condition ?
non-Markovian condition would be the case in which the next state depends on a series of previous states and actions
Temporal difference ?
ggggg