Model Identification and Data Analysis Flashcards
[Linear Classification] What’s a Perceptron.
A perceptron is the numerical solution for a linear classification problem. Where the Perceptron tells to which group a certain point should belong to. It is given by the expression:
h(x,w) = sign(w^T . X)
h(x,w) = sign ( weights_transposed . x_variables)
In the case of a 2D case where the dots are given by x1 and x2, the perceptron would be something like:
y_hat = sign (w0 + w1.x1 + w2.x2)
Where w0 corresponds to the Bias.
[Linear Classification] Given a graph with the separating line, how do you get the expression for the perceptron?
Considering a 2D case, the perceptron would have a form:
h(x1, x2) = 1, if w0 + w1x1 + w2x2>0
h(x1, x2) = -1, Otherwise.
To choose values for w0, w1 and w2, we substitute points where the line intersects the axes:
w0 + w1 (x1) = 0
w0 + w2 (x2) = 0
we get an expression similar to:
w0 = w1 = -2w2.
then consider the sign of the solution for any point outside of the line and develop a possible solution.
[Linear Classification] What’s the Perceptron Learning Algorithm (PLA) Update Rule?
The PLA Update Rule is given by:
w(t+1) = w(t) + y(t) x(t)
where:
w(t) is the vector containing the current weights of the perceptron (w0, w1, … wn).
x(t) is the vector containing the coordinates of the missclassified point (x0, x1, … xn).
y(t) is the real class of the point.
And w(t+1) is the updated Weight Vector.
[Linear Classification] How does one use the PLA?
First you evaluate a point with the current perceptron, using the point’s coordinates. If the Point is correctly classified, the PLA doesn’t change W.
If the Point is missclassified we use the Update Rule, and get a new set of W’s.
[Linear Classification] Why might Generalization be an issue with the PLA?
One doesn’t really know how good the algorithm will classify new points. The Generalization Theorem doen’t really help with identifying just how general the algorithm is.
[Linear Regression] What happens to the training and testing errors when the training set is increased? Why?
Training Error increases, because as more examples have to be fitted, it becomes hard to get close to the different points.
Testing Error decreases, because there is more information, and therefore we can develop a better model.
More training examples leads to better Generalization.
[Linear Regression] How is the Ordinary Least Squares formula derived?
1.- Define Problem: find a y-hat = Xw equation that minimizes the error between the predicted and actual values of y.
2.- Squared Loss (Cost Function):
L(w) = SUMi=1 : n (y-i - x-iT w)^2
or
L(w) = (y - Xw)T(y-Xw)
3.- Expand expression to be:
L(w) = yTy - 2wTXTy + wTXTXw
4.- Minimize expression with respect to w:
dL(w)/dw = -2XTy + 2XTXw = 0
5.- Solve for w:
XTXw = XTy
what = (XTX)-1XTy
[Linear Regression] What’s the Ordinary Least Squares (OSL) Formula
what = (XTX)-1XTy
[Linear Regression] When using OLS, what are the conditions necessary for what to be a minimum point?
1.- Gradient of Ein (what) = 0
2.- Hessian of Ein is positive definite (the derivative of the OSL formula with respect to w must be postive)
[Linear Regression] How would you describe the Generalization of the OSL algorithm?
Bias-Variance Trade-off: OSL tends to have low variance, reducing risk of overfitting. However for complex data, linear regression may not generalize well unseen data.
OSL generally generalizes better with more data.
Mean Square Error can be used on the validation data to provide a measure of generalization error.
[Logistic Classification] Why shouldn’t the Gradient Descent Algorithm Step size be too big or too small?
If the step size is too small it may take too long to train the model. If it is too big we may risk going “up the valley”, and maybe even never getting to a useful model.
[Logistic Classification] What is the purpose of using the Gradient Descent Technique?
The goal is to find the minimum of a convex loss function. We want to get to a point where the error is the minimum, therefore we need to roll-down the valley and find the minimum.
[Logistic Classification] What is the output of a logistic regression model?
The output can be used as a probability, as it is a number between 0 and 1.
What is the relationship between a Logistic Regression and Neural Networks?
Both are widely used in binary classification.
Logistic regression can be seen as a simplified neural network consisting of a single layer.
Neural Netwroks incorporate multiple layers with non-linear activation functions, allowing these to learn complex representations of the inputs.
They both can use Gradient descent to minimize loss function, but neural networks need more resources.
[Logistic Classification] What is the formula for the output of the Logistic Regression?
h(s) = e^s / (1+e^s)
equivalent to:
h(s) = 1/(1+e^-s)
Where s is the linear combination of the input features. Usually given by s = wTx
[Logistic Classification] What’s the classification boundary of a Logistic Classifier?
Where the predicted probability switches from below 0.5 to above 0.5. This is given at the point where s=0.
[Gradient Descent] How do we know when to stop the Gradient Descent algorithm?
We can choose one of the following options:
1. Set a threshold for ||∇Ein(w(t))||
2. Set uper bound on number of iterations
3. Set Threshold on Ein
4. A combination of the above
[Logistic Optimization] What’s the formula for Gradient Descent?
w(t+1) = w(t) - n ∇[Ein(w(t))]
Where n is the step of each update,
∇Ein(w(t)) is the gradient of the Cost Funcion at the current conditions.
w(t) The current weights of the model
w(t+1) Updated Weights of the model
[Gradient Descent] Explain Stochastic Gradient Descent. How is it Different from Normal Gradient Descent?
w(t+1) = w(t) - n ∇Ein(w;xi,yi)
Similar to normal gradient descent, but in this case the algorithm selects specific random training samples to compute the gradient and not the whole dataset. In general, it converges much faster compared to normal GD, but the error function might not be as well minimized as in GD.
[Bias-Variance] What is the Bias Error?
Ex [ (hbar(x)-f(x))^2]
Where:
hbar(x) is the average model with all possible datasets (NOT COMPUTABLE)
f(x) is the desired model.
Bias is constant with respect to the Dataset. It represents how much the model h within the selected Class H can get close to the target function.
[Bias-Variance] What is the Variance Error?
Ex{ED[ (hD(x)-hhat)^2]}
Constant with respect to f. It represents the deviation of the computed model hD(x) with respcet to the average one-
[Bias-Variance] What happens when variance = 0? why?
The Bias gets vey high.
[Model Selection] What are the 2 different approaches to choose the complexity of the model?
- Regularization
- Cross-Validation
[Model Selection] Explain the objective of Regularization.
Model the mismatch between Ein and Eout and define a more appropriate cost.
[Model Selection] What are the two options to perform regularization?
- Constrained optimization problem with a budget C:
minw Ein(w), subject to: ||w||22 <= C - Unconstrained Optimization problem with regularization penaly λ:
minw Eaug(w), wher:
Eaug(w) = Ein(w) + λ/N ||w||22
[Model Selection] What is the formula for Regularized Least Squares (RLS)?
w_hatreg=(XT+ λI)-1 XTY
[Model Selection] What is the Objective of Cross-Validation?
Try to model directly Eout. Without changing the cost to minimize, but the way we employ the data.
Essentially dividing the data-set into a Training Set and a Validation Set.
[Model Selection] What are the steps in order to do Cross-Validation?
- Divide Data Set into a Training set (size N-k) and a Validation set (size k).
- Learn model using the Training set: Ein(w) = 1/(N-k) sumi=1N-k(yi - wTxi)^2 => This produces a w_hat
- Validate model using the Validation set and w-hat produced by the training:
Ein(w) = 1/(k) sumi=N-k+1N(yi - w-hat Txi)^2 => This can be considered the Eout of the model
[Model Selection] In Cross-Validation, what is the trade-off in the selection of k?
A Small K produces: Large N-k, which means a better w_hat. But it produces a bad estimation of Eout( w_hat )
A large K: Good estimate of Eout( w_hat ), but a small N-k, which means a bad w_hat.
[Model Selection] In Cross-Validation, what is the rule of thumb to select k?
k = N/5
[Model Selection] How do we combine Regularization and Cross-Validation to tune λ?
Eaug(w) = Ein(w) + λ ||w||22
1st we Train the model using the training set and a fixed λ to obtain w-hat.
2nd using the validation set we choose the optimal λ-hat.
3rd we use the Test set to estimate Eout, using the w-hat and λ-hat.
[Neural Networks] How is a Neural Network of non-linear combinations optimized?
Using Gradient Descent.
w(t+1)=w(t) - n ∇Ein(w(t))