lecture 5 - neural networks Flashcards
What is the curse of dimensionality in the context of linear models?
As the number of features (or dimensions) increases, the amount of data needed to properly train a linear model grows exponentially, making data sparse in high-dimensional spaces.
Why do linear models struggle in high-dimensional spaces?
They depend on predefined basis functions to represent the data, but in high-dimensional spaces, the number of basis functions explodes, making the model computationally expensive and prone to overfitting.
What are the main solutions to the curse of dimensionality for linear models?
- Support Vector Machines (SVMs)
- Neural Networks (NNs)
How do Support Vector Machines (SVMs) help address the curse of dimensionality?
SVMs define basis functions centered on relevant training points (support vectors), simplifying the model by using only relevant points to construct the decision boundary.
How do Neural Networks (NNs) help address the curse of dimensionality?
NNs use basis functions (hidden layers) that are learned and adapted during training, allowing the model to capture complex relationships flexibly.
How are the basis functions in Neural Networks structured and optimized?
The structure of basis functions (e.g., number of neurons and activation functions) is fixed, but their weights are adaptive and optimized during training to discover useful patterns in the data.
What is the goal of a neural network?
The goal of a neural network is to build a more powerful representation of the data by applying multiple functional transformations in sequence.
What is the first step in constructing a neural network?
Start with a linear model that combines input features using weights and biases to produce a linear output.
How are new features (hidden units) created in a neural network?
New features are created by constructing linear combinations of the input features, followed by applying an activation function to introduce nonlinearity.
Why are nonlinear activation functions applied in neural networks?
Nonlinear activation functions (e.g., ReLU or sigmoid) are applied to introduce nonlinearity, enabling the network to model complex relationships.
How are additional layers added in a neural network?
After the first layer, the outputs of the hidden units are combined and passed through additional layers, each applying new weights and activation functions.
How is the final output of a neural network computed?
The final output is computed by applying another activation function (e.g., sigmoid or softmax) to the output of the last layer.
What is the purpose of using multiple layers in a neural network?
Multiple layers enable the network to learn hierarchical representations, with each layer capturing increasingly complex patterns in the data.
Which activation function is typically used for regression in the output layer of a neural network?
The identity function is used for regression tasks, as it outputs continuous real-valued numbers.
Which activation function should be used for binary classification in the output layer?
The sigmoid function is used for binary classification, as it maps outputs to probabilities between 0 and 1
Which activation function is suitable for multi-class classification in the output layer?
The softmax function is used for multi-class classification, as it converts logits into probabilities across multiple classes.
What happens when a linear function is used as the activation function in hidden layers?
It results in a linear model, which cannot handle complex relationships in the data, limiting the network’s capability.
What is the role of the sigmoid function in hidden layers?
The sigmoid function smoothly squashes inputs to a range between 0 and 1, introducing nonlinearity.
Why is the tanh function commonly used in neural networks?
The tanh function squashes inputs between -1 and 1, is centered around zero, and was often used in early neural networks for better gradient flow than the sigmoid function.
What are the advantages of using ReLU as an activation function?
ReLU outputs zero for negative inputs and the input itself for positive inputs. It is computationally efficient and helps avoid the vanishing gradient problem.
What is the difference between ReLU and Leaky ReLU?
Leaky ReLU allows small negative values for negative inputs, preventing dead neurons and improving gradient flow.
output layer activation functions
- identity (regression)
- sigmoid (binary classification)
- softmax (multi-classification)
hidden layer activation functions
- linear function (limited)
nonlinear functions
- sigmoid
- tanh
- relu
- leaky relu
What does it mean when we say that neural networks are universal function approximators?
It means that neural networks can approximate any function given enough data, layers, and hidden units.
What types of functions can neural networks approximate?
- quadratic functions
- sine waves
- absolute value functions
- step functions.
What kind of shape does a neural network approximate for a quadratic function?
A neural network approximates a parabolic shape for a quadratic function.
How does a neural network approximate a sine wave function?
It approximates the sine wave by learning the oscillatory pattern of the function.
What is the behavior of a neural network when approximating an absolute value function?
The network approximates a V-shaped function when approximating an absolute value function.
How does a neural network handle step functions?
Neural networks approximate step functions by learning discrete jumps between values, simulating thresholding behavior.
Why might a simple linear decision boundary not work well for some datasets?
A simple linear decision boundary may not work well when the data is not linearly separable, requiring a more complex decision-making model like a neural network.
What simple functions do neural networks use as building blocks?
Neural networks use simple functions like AND, OR, and other logical gates as building blocks to model complex decision boundaries.
How does a neural network solve the XOR problem?
By combining AND and the inverted OR gates, and then applying an OR gate to both, a neural network can perfectly separate points into two classes, solving the XOR problem.
What is the significance of solving the XOR problem in neural networks?
Solving the XOR problem demonstrates that neural networks can model complex, non-linear relationships, unlike simple linear models.
What was the contribution of Pitts and McCulloch (1943) to neural networks?
They developed the first mathematical model of biological neurons, showing that Boolean operations could be implemented by neuron-like nodes, which became the origin of automaton theory.
What is Hebb’s rule (1949) in the context of neural networks?
Hebb’s rule states that the connection strength between two neurons increases when both neurons are activated simultaneously.
What was Rosenblatt’s contribution to neural networks in 1958?
Rosenblatt introduced the perceptron, a network of threshold nodes for pattern classification, and formulated the perceptron convergence theorem.
How did neural networks experience renewed interest in the 1980s and 1990s?
New techniques like backpropagation, physics-inspired models (Hopfield net, Boltzmann machines), and unsupervised learning led to impressive applications, such as character recognition and speech recognition.
What modern techniques have driven the “revolution” in neural networks since the 1990s?
Modern techniques include
- support vector machines
- kernel methods
- ensemble methods (bagging, boosting, stacking)
- deep learning
- transparency methods like LIME and SHAP.
What are weights space symmetries in neural networks?
Weights space symmetries refer to the number of possible configurations in the weight space that can arise due to the nature of activation functions and the architecture of the network.
How does symmetry in the activation function create redundancy in the weight space?
Symmetric activation functions like tanh produce the same output when the input sign is flipped, allowing multiple weight configurations (e.g., positive and negative signs) to result in the same error during training.