Neural Network Basic Ingredienrs Flashcards
Activation functions are typically what type of functions?
Non linear functions
List three example activation functions
Sigmoid, ReLU, tanh
The softmax function is a generalization of what?
The logistic function, but to multiple dimensions
the softmax function is also known as what?
Softargmax or normalized exponential function
Softmax does what?
It converts a vector z of K real numbers (e.g either positive or negative(and they don’t all have to be the same sign (eg - or +))) into a vector of K probabilities s (aka a probability distribution), where the mapping between z and s is order preserving and where all the elements in s sum to one .
What is forward propagation?
The process of a neural network generating an output for a given input
Explain the specificity of the term back propagation
Backpropagation is actually only the process of calculating the gradient of the loss function with respect to the weights, iteratively using the chain rule. Formally speaking, It doesn’t refer to how the gradient is used. However, loosely speaking it is often used loosely to refer to the entire learning algorithm, including how the gradient is used (such as by stochastic gradient descent)
Stochastic gradient descent
Rprop
Resilient backpropagation
Batch
In gradient descent, a batch is the total number of examples [by examples they mean training samples, right?] you use to calculate the gradient in a single iteration.
Batch size for stochastic gradient descent
1
Convex problems
have only one minimum; that is, only one place where the slope (of the loss function (where the independent variable is the weight(s)) is exactly 0
Ln(1)
0
Cross entropy
A measure for calculating how good a classifier is. More specifically when trying to classify the class of some item, we declare a p(x) to be the true probability distribution for if some item is class x in a discrete set of x ∈ X. This p(x) is usually expressed as a one hot vector <0 0 0 0 1.0 0 0> if we say the class set is of size 7. Our model is going to predict some distribution q(x)
What does the graph of a logistic function look like?
What are the domain and range of the sigmoid function?
Domain: -infinity to infinity
Range : 0 to 1
Why is argsoftmax considered the generalization of the logistic function to multiple dimensions?
Logistic function takes a scalar of any real value and converts it into another scalar between the range of 0 and 1. Argsoftmax takes a vector (of d dimensions) of any real values and converts it into another vector (of d dimensions) whose values are all in the range of [0,1]
When is the softmax function used?
is used as the activation function in the output layer of neural networks that are performing multi-class classification problems
What is the formula for the Jacobian of the softmax function?
Categorical cross entropy loss is also called what?
Softmax loss
Classification problems can be subdivided into what two categories?
Multi-class classification and multi-label classification
Explain the difference between multi-class and multi-label classification
multi-class classification - each sample belongs to only one class (mutually exclusive)
multi-label classification - each sample may belong to multiple classes (or to no class)
Kronecker delta function
List four benefits of ReLU neurons
1) helps prevent vanishing gradients 2) its idempotent
3) it creates sparse activations
4) it’s faster to compute in both forward and backward passes
What is a downside of using ReLUs?
Dying ReLUs
What would three consecutive sigmoid activations looks like?
what would ReLU(ReLU(ReLU(2))) yield?
2
Principal difference between sigmoid and tanh?
Sigmoid range is (0, 1). Tanh range is (-1,1)
Sigmoid formula
1/(1+e^(-x))
LU in ReLU stands for what?
Linear Unit
FC layer stands for what?
Fully connected layer
A unit with an identity activation function is equivalent to what?
Multiple linear regression
A unit with a sigmoid activation function is equivalent to what?
Logistic regression
ReLU stands for what?
Rectified linear unit
Activation functions serve what two primary purposes?
1) help a model account for interaction effects
2) help a model account for non-linear effects
What is an interactive (interaction) effect?
It is when one variable A affects a prediction differently depending on the value of B. For example, if my model wanted to know whether a certain body weight indicated an increased risk of diabetes, it would have to know an individual height.
Give an example of how the output of of a node’s value is calculated
You might have input A and input B and bias b, and wA of 2 and wA of 3, and activation function F with value of ReLU. So the calculation would be ReLU(2A + 3B + b)
Tanh range
-1 to 1
A Sigmoid function is especially useful when we have to predict what?
The probability of something
Why is the sigmoid function useful for predicting probability?
Because probability of something must be between 0 and 1 which is the the same range as sigmoid
What layer is sigmoid mostly used in?
Activation layer
Why is sigmoid mainly used on activation layer?
Because When used on other layers it can cause the network to get stuck during training time
Explain the dying ReLu phenomenon
When the weights in the network lead to a pre-activation calculation for a neuron to be negative, that neuron will then have 0 value and can be called “dead” as it doesn’t affect downstream neurons
Give two examples of saturating non linearities
Tanh (x) and sigmoid(x)
ReLU formula
ReLU(x) = max(x, 0)
Rectified networks
Networks that use the rectifier function (ReLU) for hidden layers
Why is it useful that part of ReLU is linear?
Because ReLU préserves many of the useful features of linear models
List two of the useful properties of linear activation functions
1) They’re easy to optimize with gradient based methods 2) they generalize well
Is ReLu considered linear or nonlinear?
Nonlinear
List two common learning rate schedules
1) constant learning rate
2) monotonically decaying rate after an optional warmup
Mlp stands for?
Multi layer perceptron
What is an MLP
A sequence of layers of neurons
What is an intuitive understanding of what the bias is?
The bias of a neuron controls the trigger happiness of the neuron. The hire it is, the more likely it is to fire (if act function is ReLu) otherwise it just adds a scalar value to the output
In rectified networks does every hidden layer use ReLUs?
No, not necessarily
Are activation functions part of a neuron? Or are they applied after a neuron, I.e is there an entire activation function layer applied to the outputs of a layer?
Logically activation functions are part of a neuron - they convert the pre output of a neuron into the output of a neuron
Some frameworks or notations however might apply the activation function as a separate layer applied element-wise to a whole set of neuron outputs (i.e a layer’s output)