Error-Based Learning Flashcards
What defines a parameterized model?
A parameterized model is defined by a collection of parameters, with specific parameter combinations yielding optimal model performance.
What is the difference between parameters and hyper-parameters in a model?
Hyper-parameters are set before training begins. Parameters are learned during training.
How are model parameters adapted or fit?
Model parameters are adapted based on the feature distribution in a training set.
What guides the model fitting process?
Model fitting is guided by an error function that evaluates the model’s error rate against the ground truth.
What indicates model convergence in the context of an error function?
Model convergence is indicated by an error rate approaching a lower limit, ideally close to zero.
Why does the L2 function (sum of squared error) use the sum of squared errors rather than just the sum of errors?
To ensure that positive and negative errors don’t cancel each other out.
What does each combination of weights w[0] and w[1] correspond to on an error surface?
Each combination corresponds to a sum of squared errors value, defining a point on the error surface above the x-y plane in weight space.
What is weight space in the context of modelling error surfaces?
Weight space is the x-y plane defined by possible combinations of weights w[0] and w[1].
What does the error surface represent in a model’s weight space?
The error surface represents the sum of squared errors for each combination of weights, with height indicating the error value.
Where is the model that best fits the training data located on the error surface?
It’s located at the lowest point on the error surface, which corresponds to the minimum sum of squared errors.
What two key properties of error surfaces help in finding the optimal combination of weights?
Error surface are convex (bowl-shaped) and have a global minimum, making it easier to locate the optimal weights.
Why are error surfaces for linear models typically convex with a global minimum?
The convex shape is determined by the linearity of the model, not by the properties of the data.
What is the method called for finding the best set of weights by minimizing the error surface?
Least squares optimization.
What is gradient descent?
It is an algorithm that uses a guided search from a random starting position to iteratively move toward the global minimum of the error surface.
How does gradient descent work?
Gradient descent uses the slope of the error surface to take small steps in the direction that reduces error, moving closer to the global minimum with each step.
What is the learning rate (alpha) in gradient descent?
It is a parameter that determines the size of the adjustments made to weights at each iteration of the algorithm.
What does the error delta function do in gradient descent?
It calculates the adjustment (delta value) for each weight based on the gradient of the error surface, ensuring movement toward the global minimum.
What is batch gradient descent?
It is a form of gradient descent where each weight adjustment is made based on the sum of squared errors across the entire training set.
What is the inductive bias of batch gradient descent in multivariable linear regression?
It includes a preference for models that minimize the sum of squared errors and restricts to linear combinations of descriptive features.
Why is a single random starting point used in gradient descent?
It allows the algorithm to explore the error surface and find the global minimum without the need to try multiple starting points due to the convex nature of the surface.
How do the learning rate and initial weights impact the gradient descent algorithm?
They influence the speed and accuracy of convergence.
What happens if the learning rate is:
a) too small?
b) too large?
a) Gradient descent converges very slowly (tiny changes to weights at each iteration).
b) It can cause large jumps across the error surface, potentially missing the global minimum and causing instability.
What is the ideal learning rate behavior in gradient descent?
A well-chosen learning rate converges quickly to the global minimum without overshooting or instability.
How does normalization of features affect the selection of initial weights?
Normalization makes it easier to select initial weights, as the range for weights with normalized features is better defined.