Class 5 6 7 8 Flashcards
Explain briefly the input output and hidden units of the artificial neural networks.
The input unit represents the input variables
The output unit represent the output variables
Hidden units represent the correlation between input and output units
Explain deep feed forward neural notwokrs
Deep feed forward neural netowkrs is composed of different functions
f3(f2(f1(x))) these are the hidden layers and f3 is also an output unit.
Each hidden layer is vector valued
Width represents the total units in a layer, where the depth is the number of layers
Why Deep? Why non linear?
We need a deeper representation because, for example with XOR problem we need a more complex structure than the perceptron in order to classify the point correctly. Therefore we needed the hidden reppresentations to obtain deeper networks.
We needed non linearity because, if we have one hidden layer with a linear function or more than one hidden layer all of them are linear functions, it makes no difference, becuase they are linear, they will collapse into one single layer since we can write them as a matrix multiplication, especially in order to solve the case of XOR we not only need to have a deeper network but we also need the nonlinearity.
A single perceptron is not able to learn the boolean functions, (XOR) however a network of perceptrons may learrn AND OR NOT
Why is non linearity in the network is more complex?
Non linearity brings non convexity, therefore we maybe wont be able to reach a global minimum. and it is really sensitive to the starting point.
We can model a network as a distribution P(y|x,model) and use maximum likelihood to train the network.
the loss function is MSE
1/2Ntr TOPLAM (y - x ) ^2
Linear ,bernoulli, multinoulli, softmax, sigmoid , gaussian?
Linear units uses gaussian and MSE, maximizing the likelihood corresponds to minimizing the MSE.
Sigmoid units are used for bernoulli, bernouli produces 2 outputs. Therefore they are used for the binary classificaiton problems.
Sigmoid outputs [0,1] near to 0 and 1, the gradient value is 0
1 / (1 + e^-net)
Softmax is used with the multinoulli because there are k different classes.
What is negative log likelihood? Explain in details.
Negative log likelihood is a loss function which is used in classification tasks. -logP(y|x)
It prevents the gradient saturation , if the value of y=1 and z is very positive, or the y=0 and z is very negative then saturation occurs. I can be used with both softmax and sigmoid, but better results are obtained by using softmax. Because softmax’s function is:
softmax(z) = exp(z) / TOPLAM exp(z)
When we put softmax and negative loglikelihood we obtain
zi-log TOPLAM exp(zi)
softmax toplama yaptığı için logu yükseltiyor
Relu vs leaky relu
Relu is an activationj function generally used for the hidden units. There is no saturation occurs in RELU.
Relu(x) = max(x,0)
Relu often have vanishing gradients where the graidents become zero.
To overcome this issue, leaky relu is used.
max(0,x) + alpha min(0,x)
What is universal approximation? Briefly explain it, then explain why it is not used?
Universal approximation is:
If we have one hidden layer, it is enogh to REPRESENT
an aproroximation of any function to an arbitrary degree of accuracy
But it is not enough and we need DEPTH with width, because with universal approximation, we would have exponential number of units, or the network would overfit, therefore having depth with width is necesary
what is the advantage of having depth in NN?
We want to learn a function which is a composition of other functions.
hidden layers are folding the space to create mirror affects.
Greater depth leads to a better generalization
What is forward and back propagation? Is learning done in backpropagation?
Forward propagation: The input is propagated through the network, produced an output and a cost
Backward propagation: Information from the cost flows backwards to compute the derivative for each parameter
Backpropagation is just a chain rule of calculus.
I train my neural network using stoıchasticc gradient descend and compute the gradients with back propagation