Error-Based Learning Flashcards by Mine Cronje

What defines a parameterized model?

A parameterized model is defined by a collection of parameters, with specific parameter combinations yielding optimal model performance.

How well did you know this?

Not at all

Perfectly

What is the difference between parameters and hyper-parameters in a model?

Hyper-parameters are set before training begins. Parameters are learned during training.

How well did you know this?

Not at all

Perfectly

How are model parameters adapted or fit?

Model parameters are adapted based on the feature distribution in a training set.

How well did you know this?

Not at all

Perfectly

What guides the model fitting process?

Model fitting is guided by an error function that evaluates the model’s error rate against the ground truth.

How well did you know this?

Not at all

Perfectly

What indicates model convergence in the context of an error function?

Model convergence is indicated by an error rate approaching a lower limit, ideally close to zero.

How well did you know this?

Not at all

Perfectly

Why does the L2 function (sum of squared error) use the sum of squared errors rather than just the sum of errors?

To ensure that positive and negative errors don’t cancel each other out.

How well did you know this?

Not at all

Perfectly

What does each combination of weights w[0] and w[1] correspond to on an error surface?

Each combination corresponds to a sum of squared errors value, defining a point on the error surface above the x-y plane in weight space.

How well did you know this?

Not at all

Perfectly

What is weight space in the context of modelling error surfaces?

Weight space is the x-y plane defined by possible combinations of weights w[0] and w[1].

How well did you know this?

Not at all

Perfectly

What does the error surface represent in a model’s weight space?

The error surface represents the sum of squared errors for each combination of weights, with height indicating the error value.

How well did you know this?

Not at all

Perfectly

Where is the model that best fits the training data located on the error surface?

It’s located at the lowest point on the error surface, which corresponds to the minimum sum of squared errors.

How well did you know this?

Not at all

Perfectly

What two key properties of error surfaces help in finding the optimal combination of weights?

Error surface are convex (bowl-shaped) and have a global minimum, making it easier to locate the optimal weights.

How well did you know this?

Not at all

Perfectly

Why are error surfaces for linear models typically convex with a global minimum?

The convex shape is determined by the linearity of the model, not by the properties of the data.

How well did you know this?

Not at all

Perfectly

What is the method called for finding the best set of weights by minimizing the error surface?

Least squares optimization.

How well did you know this?

Not at all

Perfectly

What is gradient descent?

It is an algorithm that uses a guided search from a random starting position to iteratively move toward the global minimum of the error surface.

How well did you know this?

Not at all

Perfectly

How does gradient descent work?

Gradient descent uses the slope of the error surface to take small steps in the direction that reduces error, moving closer to the global minimum with each step.

How well did you know this?

Not at all

Perfectly

What is the learning rate (alpha) in gradient descent?

Study These Flashcards

It is a parameter that determines the size of the adjustments made to weights at each iteration of the algorithm.

What does the error delta function do in gradient descent?

Study These Flashcards

It calculates the adjustment (delta value) for each weight based on the gradient of the error surface, ensuring movement toward the global minimum.

What is batch gradient descent?

Study These Flashcards

It is a form of gradient descent where each weight adjustment is made based on the sum of squared errors across the entire training set.

What is the inductive bias of batch gradient descent in multivariable linear regression?

Study These Flashcards

It includes a preference for models that minimize the sum of squared errors and restricts to linear combinations of descriptive features.

Why is a single random starting point used in gradient descent?

Study These Flashcards

It allows the algorithm to explore the error surface and find the global minimum without the need to try multiple starting points due to the convex nature of the surface.

How do the learning rate and initial weights impact the gradient descent algorithm?

Study These Flashcards

They influence the speed and accuracy of convergence.

What happens if the learning rate is:
a) too small?
b) too large?

Study These Flashcards

a) Gradient descent converges very slowly (tiny changes to weights at each iteration).
b) It can cause large jumps across the error surface, potentially missing the global minimum and causing instability.

What is the ideal learning rate behavior in gradient descent?

Study These Flashcards

A well-chosen learning rate converges quickly to the global minimum without overshooting or instability.

How does normalization of features affect the selection of initial weights?

Study These Flashcards

Normalization makes it easier to select initial weights, as the range for weights with normalized features is better defined.

What are four differences between linear regression and logistic regression?

1) OUTPUT TYPE Linear regression outputs a continuous numeric value. Logistic regression outputs a probability value (between 0 and 1), which is then thresholded to classify the input as one of two classes. 2) FUNCTION TYPE Linear regression uses a linear equation directly as the output. Logistic regression applies a sigmoid (logistic) function to the linear equation’s output, mapping it to a probability range. 3) ERROR TYPE Linear regression uses MSE as loss function. Logistic function uses binary cross-entropy as loss function. 4) GRADIENT DESCENT APPLICABILITY Linear Regression can directly use gradient descent on the error surface defined by the MSE. Logistic regression uses gradient descent on the log loss function, thanks to the differentiable sigmoid function. This wouldn’t be possible with a step function due to its non-differentiability.

What are three advantages of using basis functions?

1) Allows more flexible modeling by enabling more complex relationships, not restricted to linear models. 2) Can capture non-linear patterns. 3) Changes the inductive bias of the gradient descent algorithm, allowing for more complex types of regression models.

What are four disadvantages of basis functions?

1) The analyst must design the set of basis functions to use, which can be challenging. 2) As the number of basis functions increases, the model complexity grows, making it harder to optimize. 3) Gradient descent must search through a more complex weight space when more basis functions are added, potentially slowing training. 4) Risk of overfitting as more basis functions are added, especially with limited training data.

What is a multinomial logistic regression model? And what does it mean when the model is one-versus-all?

A multinomial logistic regression model is an extension that handles categorical target features with more than two levels. A one-versus-all model distinguishes between one level of the target feature and all the others.

What is the difference between a multiclass problem and a multilabel problem?

In a multiclass problem, there are multiple classes, but each instance belongs only to one class. In a multilabel problem, each instance can belong to multiple classes simultaneously.

Error-Based Learning Flashcards

(30 cards)