0 - Activation Functions Flashcards
What are 4 desirable properties of activation functions?
1) Introduces *Non-Linearity* 2) Bounds output into a *Range* of values 3) Continuously Differentiable 4) Monotonic (guarantees convexity)
What is the Sigmoid Function, what does it look like, and when do we use it?
f(x) = 1 / (1 + e<sup>-x</sup>) f<sup>'</sup>(x) = f(x)(1- f(x))
Sigmoid functions are used in machine learning for the logistic regression and basic neural network implementations and they are the introductory activation units. But for advanced Neural Network Sigmoid functions are not preferred due to various drawbacks.
Although sigmoid function and it’s derivative is simple and helps in reducing time required for making models, there is a major drawback of info loss due to the derivative having a short range (this is the vanishing gradients problem).
What is the RELU function, what does it look like, and when do we use it?
f(x) = max(x, 0)
A Rectified Linear Unit (A unit employing the rectifier is also called a rectified linear unit ReLU) has output 0 if the input is less than 0, and raw output otherwise. That is, if the input is greater than 0, the output is equal to the input. The operation of ReLU is closer to the way our biological neurons work.
Most Deep Learning applications right now make use of ReLU instead of Logistic Activation functions.
ReLU is non-linear and has the following advantages:
1) Biological plausibility: One-sided, compared to the antisymmetry of tanh.
2) Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (having a non-zero output)
3) Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.
4) Efficient computation: Only comparison, addition and multiplication
5) Scale-invariant: max ( 0, a x ) = a max ( 0 , x ) for a ≥ 0
RELU has the problems of being Non-Zero centered and is non-differentiable at Zero, but differentiable anywhere else.
Another problem we see in ReLU is the Dying ReLU problem where some ReLU Neurons essentially die for all inputs and remain inactive no matter what input is supplied, here no gradient flows and if large number of dead neurons are there in a Neural Network it’s performance is affected, this can be corrected by making use of what is called Leaky ReLU where slope is changed left of x=0 in above figure and thus causing a leak and extending the range of ReLU.
What is the Softmax function, what does it look like, and when do we use it?
S(yi) = eyi / Σj eyj
def softmax(x): return np.exp(x) / np.sum(np.exp(x), axis=0)
Softmax is a very interesting activation function because it not only maps our output to a [0,1] range but also maps each output in such a way that the total sum is 1. The output of Softmax is therefore a probability distribution. The output are probabilities.
The softmax function is often used in the final layer of a neural network-based classifier. Such networks are commonly trained under a log loss (or cross-entropy) regime, giving a non-linear variant of multinomial logistic regression.
In conclusion, Softmax is used for multi-classification in logistic regression model whereas Sigmoid is used for binary classification in logistic regression model
What is the tanh(x) function, what does it look like, and when do we use it?
tanh(x) = (2 / (1 + e-2x) ) - 1
- It is nonlinear in nature, so we can stack layers
- It is bound to the range (-1, 1)
- The gradient is stronger for tanh than sigmoid ( derivatives are steeper)
- Like sigmoid, tanh also has a vanishing gradient problem.
In practice, optimization is easier in this method hence in practice it is always preferred over Sigmoid function. And it is also common to use the tanh function in a state to state transition model (recurrent neural networks).