08_neural networks Flashcards
How do kNN, linear models and tree-based models really learn?
not iteratively
knn: computed distances and compares distribution of unseen data points with distribution of seen data points
linear: fitted to seen data based on the task
tree-based: identify and memorize patterns relevant to the task
With which three components does the human brain work?
Neurons (nerve cells)
Dendrites (connects neurons)
Axons (long-distance connections)
–> neurons are inter-connected forming a dense network
How is information passed through neurons in the human brain?
through electrical signals
connected neurons absorb the incoming signals and process them. some of them will fire, but not all.
–> cascade of signals
What do we need for neural networks to represent the deep cascade of the layers of neurons in a human brain?
input data, which is processed in its hidden layers, and generates output data
What is a fully connected network?
a neural network where each neuron is connected to all neurons in the previous layers and all the neurons in the following layer
How can a fully connected neural network be characterized?
- number of layers (depth)
- number of neurons in each layer
- number of input variables (= number of neurons in the first layer)
- number of output variables (= number of neurons in the final layer)
How does a fully connected neural network work?
- vectorial input data provided to the network, one value per neuron in the input layer
- all inputs are seen by each neuron in the underlying layer
- each neuron will process the incoming info, firing (1) under some conditions and (0) otherwise
- repeat
- output is generated in the final layer
How does a neural network act in general terms?
acts as a function approximator
- any mathematical function can be approcimated
Can we implement artificial neural networks to learn specific tasks?
through connectionism, everything is connected with everything
What are two problems we have to solve before we can implement artificial neural networks?
1) how to implement neurons?
2) how to train the network?
How does a general neuron work?
number of inputs might differ from number of outputs - what is the function?
takes in a vector of values, processes them and returns a binary signal based on its learned behavior, which is then passed on to all neurons in the following layer
What is part of the function of a perceptron?
input variable x,
weight w
bias value b
–> if the resulting value is greater zero, perceptron neuron fires, otherwise not
step function is called activation function: introduces non-linearity into the output of the perceptron
What can a single perceptron be considered as?
a linear classifier
How do we train a perceptron?
perceptron learning rule, weights are adjusted by a step size that is called the LEARNING RATE
by iteratively running this algorithm over training data multiple times, weights can be learned so that the model perform properly
What is a major limitation of individual perceptrons?
inability to reproduce a logical exclusive-or (XOR) function!
- bc are simply linear functions
multi-layer perceptrons concatenate layers of perceptrons, which makes them much more powerful
What does MLP stand for?
multi-layer perceptron
What are MLPs?
simple feed-forward neural networks (info traverses graph in only one direction
- fully-connected
- can learn more complex relations from data than single perceptrons, each layer adds NON-LINEARITIES that increase the model’s capacity
- modern MLPs utilize additional layers and other non-linear activation functions that support the learning process
What is the function behind a neuron?
x * w + b > 0
What do artificial neurons compute?
dot-product between input vectors and learned weights
and produce an output signal that propagates through all deep layers
What is a perceptron?
simple artificial neuron that produces a binary output
What is a multi-layer perceptron?
an early fully-connected neural network
What does an activation function do?
defines when a neuron “fires”
Non-linearity increases in the model’s capacity
What is a simple step function?
g(x) = geschwungene Klammer entweder 1 if <condition> or 0 else</condition>
to define whether a neuron fires or not
What are advantages and disadvantages of the step function?
+ simple to implement
+ computationally inexpensive
- only binary (discrete) output
- no gradient
What is the sigmoid function?
o(x) = exp(x) / 1 + exp(x)
What are advantages and disadvantages of the sigmoid function?
+ continuous non-linear function
+ gradient defined
- asymmetric output value range [0, 1]
- computationally expensive
What is the tanh function?
tanh (x) = sinh (x) / cosh (x)
What are advantages and disadvantages for the tanh function?
+ continuous non-linear function
+ gradient defined
- symmetric output value range [-1, 1]
- computationally expensive
What is the ReLu function?
rectified linear unit function
ReLU(x) = geschwungene Klammer x if x>0; = else
What are advantages of the ReLU function?
+ continuous non-linear function
+ gradient defined, and simple to compute
+ computationally inexpensive
Why is it important for the activation function to be differentiable?
we need the gradient to be computable.
therefore, the step function is not a good choice as it has no gradient
Why is the ReLU used most often?
Sigmoid, Tanh and ReLU roughly lead to similar results, but the ReLU is computationally the most efficient
What should a good activation function be?
continuously differentiable
non-linear
computationally inexpensive
What enables deep neural networks to learn complex tasks?
the non-linearity of activation functions
What is the Least squares fitting in linear regression?
a convex optimization problem:
there is only one solution to the problem and it is per definition the best solution
How do we modify the neural networks weights to reduce the loss?
- random changes (possible but not very goal-oriented)
- backpropagation (we check for every single weights how changing it would affect the loss)
How can we modify each individual weight parameter?
based on computed gradients
wi = wi - alpha upsidedowntriangle wi
What is a learning rate?
alpha (definition of step size for the modifications to the weights)
What is stochastic gradient descent?
iterative process, depends on the random selection of mini-batches
following the gradients in the weight space to the lowest loss value
–> allows us to find the minimum of the loss in an iterative process
What happens if we use a small learning rate?
it will take a long time to reach the global minimum, we could also possibly get stuck in a local minimum
What happens if we use a large learning rate?
it is possible that we miss the global minimum,
also convergence is unlikely
How do neural networks learn?
learn patterns from data to perform specific tasks
early layers extract low-level signals with spatial significance
later layers interpret these signals and provide semantic significance
–> end-to-end learning
What does Stochastic gradient descent (SGD) do?
it uses the gradients computed with backpropagation to update network weight parameters iteratively to reduce the model’s loss
What is key to a meaningful training process in neural networks?
ability to compute the gradient of the loss function
with respect to every single network weight parameter
this is achieved through a process called backpropagation
What is the neural network training pipeline?
1) sample batch (input data x and target data y) from training dataset
1 epoch:
- evaluate model on batch input data (prediction) in forward pass
- compute loss on prediction and target y
- compute weight gradients with backprop.
- modify weights based on gradients and learning rate
- repeat for all batches
2) repeat for a number of epochs, monitor training and validation loss + metrics
3) stop before overfitting sets in
What do you see in the curves of the training and the validation loss in well-trained neural network models?
If the validation loss sinks less fast than the training loss but still does not go up after some iterations, the model is well-trained