lecture 5 - neural networks Flashcards
What is the curse of dimensionality in the context of linear models?
As the number of features (or dimensions) increases, the amount of data needed to properly train a linear model grows exponentially, making data sparse in high-dimensional spaces.
Why do linear models struggle in high-dimensional spaces?
They depend on predefined basis functions to represent the data, but in high-dimensional spaces, the number of basis functions explodes, making the model computationally expensive and prone to overfitting.
What are the main solutions to the curse of dimensionality for linear models?
- Support Vector Machines (SVMs)
- Neural Networks (NNs)
How do Support Vector Machines (SVMs) help address the curse of dimensionality?
SVMs define basis functions centered on relevant training points (support vectors), simplifying the model by using only relevant points to construct the decision boundary.
How do Neural Networks (NNs) help address the curse of dimensionality?
NNs use basis functions (hidden layers) that are learned and adapted during training, allowing the model to capture complex relationships flexibly.
How are the basis functions in Neural Networks structured and optimized?
The structure of basis functions (e.g., number of neurons and activation functions) is fixed, but their weights are adaptive and optimized during training to discover useful patterns in the data.
What is the goal of a neural network?
The goal of a neural network is to build a more powerful representation of the data by applying multiple functional transformations in sequence.
What is the first step in constructing a neural network?
Start with a linear model that combines input features using weights and biases to produce a linear output.
How are new features (hidden units) created in a neural network?
New features are created by constructing linear combinations of the input features, followed by applying an activation function to introduce nonlinearity.
Why are nonlinear activation functions applied in neural networks?
Nonlinear activation functions (e.g., ReLU or sigmoid) are applied to introduce nonlinearity, enabling the network to model complex relationships.
How are additional layers added in a neural network?
After the first layer, the outputs of the hidden units are combined and passed through additional layers, each applying new weights and activation functions.
How is the final output of a neural network computed?
The final output is computed by applying another activation function (e.g., sigmoid or softmax) to the output of the last layer.
What is the purpose of using multiple layers in a neural network?
Multiple layers enable the network to learn hierarchical representations, with each layer capturing increasingly complex patterns in the data.
Which activation function is typically used for regression in the output layer of a neural network?
The identity function is used for regression tasks, as it outputs continuous real-valued numbers.
Which activation function should be used for binary classification in the output layer?
The sigmoid function is used for binary classification, as it maps outputs to probabilities between 0 and 1
Which activation function is suitable for multi-class classification in the output layer?
The softmax function is used for multi-class classification, as it converts logits into probabilities across multiple classes.
What happens when a linear function is used as the activation function in hidden layers?
It results in a linear model, which cannot handle complex relationships in the data, limiting the network’s capability.
What is the role of the sigmoid function in hidden layers?
The sigmoid function smoothly squashes inputs to a range between 0 and 1, introducing nonlinearity.
Why is the tanh function commonly used in neural networks?
The tanh function squashes inputs between -1 and 1, is centered around zero, and was often used in early neural networks for better gradient flow than the sigmoid function.
What are the advantages of using ReLU as an activation function?
ReLU outputs zero for negative inputs and the input itself for positive inputs. It is computationally efficient and helps avoid the vanishing gradient problem.
What is the difference between ReLU and Leaky ReLU?
Leaky ReLU allows small negative values for negative inputs, preventing dead neurons and improving gradient flow.
output layer activation functions
- identity (regression)
- sigmoid (binary classification)
- softmax (multi-classification)
hidden layer activation functions
- linear function (limited)
nonlinear functions
- sigmoid
- tanh
- relu
- leaky relu
What does it mean when we say that neural networks are universal function approximators?
It means that neural networks can approximate any function given enough data, layers, and hidden units.