week 6 - intro to deep learning Flashcards
what are differences between a brain and neural networks?
The brain is much more computationally efficient, consumes much less energy
brain neurons have a greater level of non-linearity than neurons in MLP’s. This means they are better able to represent higher dimensions of the data
Brains are typically more modular (meaning they contain specialist regions for distinct functions) and less simplified than neural networks
what does increasing the number of parameters in neural nets do?
increasing emergent capabilities of the model as number of parameters increase
As parameters increase, the model is able to solve new types of tasks
what is the neural perception
the most basic unit of neural networks
signal comes in, and if a certain threshold is passed signal comes out, output is 1; if a certain threshold is not passed, output is 0
the threshold can be represented by a step function
what is the XOR problem?
say you have two inputs that could either be 1 or 0
using a linear model, you cannot create a function that outputs TRUE only if the two inputs are different
How does the MLP solve the XOR problem?
if you aggregate the output of two neurons, and put it as input into other neurons in another layer, you can solve the XOR problem
These extra neurons in the hidden layer contain non-linear transformations that transform the space, to allow a linear line to pass between the points
The MLP introduces a hidden layer with neurons that apply non-linear transformations. This layer maps the original input into a higher-dimensional space where the problem becomes linearly separable.
How is the MLP a universal function approximator?
Any function can be approximated with a wide enough hidden layer
how does a model adjust the weights by itself
It calculates the loss function
It minimises the loss using gradient descent
what is the loss function for regression tasks and classification tasks?
regression tasks: L1 = mean absolute error, L2 = mean squared error
classification tasks: Cross entropy (the difference between the true and the predicted probability distributions)
what are pros and cons of L1 and L2 loss
L1 - penalizes all errors linearly
- robust to outliers because large errors are not exaggerated
L2 - penalizes large errors more because of squaring
- Less robust to outliers as large errors dominate the loss
why do we use the particular loss functions as described above
because they are differentiable, meaning they allow for gradient descent
what is the derivative
for every specific point, there is a variation which is the differential (slope) of the function
what is the problem with using a step-function as the activation function
if we want a model to learn by itself the step function can’t be used because you can’t calculate a derivative from a step function
Instead we use a sigmoid or a Relu
why do we prefer relu over the sigmoid function
instead of a lower value being 0 and an upper 1,
relu doesn’t have an upper value. This means that once the input reaches a specific threshold, the output becomes a linear representation
this allows you to give more info on the output. Instead of the output being 0 or 1, the output is 0 or any positive value up to infinity
what is the main essence of backpropogation
if your trying to adjust a weight that is very far from your output, that means that it is further removed from the loss function
backpropogation is a way to measure and adjust the weights of specific neurons, independently of how far back we go
it shows the impact of each neuron on the loss function, by just looking at the previous step and the next step
why are neural networks useful for data that is not linearly seperable?
The multiple layers of neural networks allow data to be seperated non linearly
Even if the seperation is very complex, with enough layers the neural network can approximate it, as a neural network can act as a universal function approximator. It does this by combining many activation functions with different weights and biases.
Essentially, the weights and biases transform the activation functions, allowing them to, when combined, approximate very complex functions.