Week 2: Learning Linear & Logistic Regression Flashcards
What is the loss function?
Is a measure that expresses the difference between actual outcome y and regression plane
θ^T x_i for one observation i.
State the loss function for linear & logistic regression respectively.
Linear regr.: Squared error loss;
L(y,y.hat) = (y.hat - y)^2,
Log regr.: Cross-entropy loss,
L(y, g(x)) = {ln g(x) for y =1, ln (1-g(x) for y = -1
Which loss function do we actually use when training a classifier? Why not the misclassification?
The cross-entropy loss. Besides formulating the loss function not just in termms of the hard class prediction y.hat, we also incorporate the predicted class probabability g(x).
Not misclassification since,
1) using cross-entropy can result in a model that generalizes better from the training data, as the final prediction y.hat does not reveal all aspects of the classifier (this would mean pushing the decision boundaries further away from the training data points),
2) Misclassification would give us a piecewise constant cost function = impossible for numerical optimization as the gradient is zero everywhere (except where undefined).
What is the cross-entropy loss function and when is it used?
Commonly used in classification tasks, especially for binary and multiclass classification. It measures the dissimilarity between predicted class probabilities and true class probabilities.
How do we measure closeness in OLS regression?
By residual sum of squares (RSS). Pick values of theta such that RSS is minimized on the training data.
How do we write RSS? (not matrix not.) = s
Sum (yi-yi^hat)^2 = sum (ei)^2
What do we generally model with both linear and logistic regression?
The conditional expectation of y given x.
What is the perspective on linear regression in ML (compared to classical statistics)?
Emphasize on learning the function instead of estimating parameters (for inference, e.g.).
Formula for squared error loss?
(yi-theta^T * x_i) ^2
Dimensions of X and Y in the training sample?
X has dim [n x (p+1)], Y has dim [n x 1].
The default vector, is it column or row?
Column
Difference between loss and cost functions?
Loss functions measures dissimilarity between observed output for observation i and the p-dim regression plane for one observation individually, while cost function measures the same dissimilarity but averaged over all observations in training sample.
Is linear regression parametric or non-parametric?
Parametric
Is logistic regression parametric or non-parametric?
Parameteric
At what rate does the loss function grow as the difference between y and the prediction yhat(x;theta) increases?
Quadratically.
What function and which error do we use in linear regression to solve the minimization problem?
Cost function (with squared error loss for learning)
What is the cost function?
It represents the overall performance of a machine learning model across all data points and is used to update the model’s parameters during training. It is calculated as the average loss over all data points in T.
It is then typically used in the optimization process, such as gradient descent, to adjust the model’s parameters to minimize the cost.
What are the normal equations?
The equations that we can, by taking 1st derivative and setting theta-hat equal to zero, use in order to find the OLS/ML estimates of beta-hat.
Formula for normal equations (vector not.)?
(X^T X)^(-1) X^T y
Illustrate the squared error loss function with a graph!
What is another name for the cost function using squared error loss?
Least squares cost
Learn with loss on T and minimize with cost?
What does it mean to maximize the (log) likelihood functon?
To find the values of theta that makes observing y as likely as possible
Why do we use the log likelihood instead of the likelihood?
Since the log will give us the sum of all individual products instead of the product which is harder to compute. Log is a monotonically increasing function, the maximum of the likelihood will be at the same place as the log likelihood.
Write the log likelihood function for a linear regression model with normal distributed noise terms
What are dummy variables?
What is one-hot encoding?
One-hot encoding is a technique used in machine learning and data preprocessing to represent categorical data as binary vectors. In this encoding, each category or label is represented by a binary vector where only one element is “hot” (set to 1) while all others are “cold” (set to 0). This encoding allows ML algorithms to work with categorical data, as they typically require numerical input. One-hot encoding ensures that there is no inherent ordinal relationship between categories, making it suitable for various types of categorical data, such as color names, product categories, or city names
What is the easiest way to deal with categorical input variables for linear regression?
To create dummy variables (for two values) or one-hot encoding variables (for more than two values).