Week 2: Learning Linear & Logistic Regression Flashcards by Paula Nordahl

What is the loss function?

Is a measure that expresses the difference between actual outcome y and regression plane
θ^T x_i for one observation i.

How well did you know this?

Not at all

Perfectly

State the loss function for linear & logistic regression respectively.

Linear regr.: Squared error loss;
L(y,y.hat) = (y.hat - y)^2,

Log regr.: Cross-entropy loss,
L(y, g(x)) = {ln g(x) for y =1, ln (1-g(x) for y = -1

How well did you know this?

Not at all

Perfectly

Which loss function do we actually use when training a classifier? Why not the misclassification?

The cross-entropy loss. Besides formulating the loss function not just in termms of the hard class prediction y.hat, we also incorporate the predicted class probabability g(x).

Not misclassification since,
1) using cross-entropy can result in a model that generalizes better from the training data, as the final prediction y.hat does not reveal all aspects of the classifier (this would mean pushing the decision boundaries further away from the training data points),

2) Misclassification would give us a piecewise constant cost function = impossible for numerical optimization as the gradient is zero everywhere (except where undefined).

How well did you know this?

Not at all

Perfectly

What is the cross-entropy loss function and when is it used?

Commonly used in classification tasks, especially for binary and multiclass classification. It measures the dissimilarity between predicted class probabilities and true class probabilities.

How well did you know this?

Not at all

Perfectly

How do we measure closeness in OLS regression?

By residual sum of squares (RSS). Pick values of theta such that RSS is minimized on the training data.

How well did you know this?

Not at all

Perfectly

How do we write RSS? (not matrix not.) = s

Sum (yi-yi^hat)^2 = sum (ei)^2

How well did you know this?

Not at all

Perfectly

What do we generally model with both linear and logistic regression?

The conditional expectation of y given x.

How well did you know this?

Not at all

Perfectly

What is the perspective on linear regression in ML (compared to classical statistics)?

Emphasize on learning the function instead of estimating parameters (for inference, e.g.).

How well did you know this?

Not at all

Perfectly

Formula for squared error loss?

(yi-theta^T * x_i) ^2

How well did you know this?

Not at all

Perfectly

Dimensions of X and Y in the training sample?

X has dim [n x (p+1)], Y has dim [n x 1].

How well did you know this?

Not at all

Perfectly

The default vector, is it column or row?

Column

How well did you know this?

Not at all

Perfectly

Difference between loss and cost functions?

Loss functions measures dissimilarity between observed output for observation i and the p-dim regression plane for one observation individually, while cost function measures the same dissimilarity but averaged over all observations in training sample.

How well did you know this?

Not at all

Perfectly

Is linear regression parametric or non-parametric?

Parametric

How well did you know this?

Not at all

Perfectly

Is logistic regression parametric or non-parametric?

Parameteric

How well did you know this?

Not at all

Perfectly

At what rate does the loss function grow as the difference between y and the prediction yhat(x;theta) increases?

Quadratically.

How well did you know this?

Not at all

Perfectly

What function and which error do we use in linear regression to solve the minimization problem?

Study These Flashcards

Cost function (with squared error loss for learning)

What is the cost function?

Study These Flashcards

It represents the overall performance of a machine learning model across all data points and is used to update the model’s parameters during training. It is calculated as the average loss over all data points in T.

It is then typically used in the optimization process, such as gradient descent, to adjust the model’s parameters to minimize the cost.

What are the normal equations?

Study These Flashcards

The equations that we can, by taking 1st derivative and setting theta-hat equal to zero, use in order to find the OLS/ML estimates of beta-hat.

Formula for normal equations (vector not.)?

Study These Flashcards

(X^T X)^(-1) X^T y

Illustrate the squared error loss function with a graph!

Study These Flashcards

What is another name for the cost function using squared error loss?

Study These Flashcards

Least squares cost

Learn with loss on T and minimize with cost?

Study These Flashcards

What does it mean to maximize the (log) likelihood functon?

Study These Flashcards

To find the values of theta that makes observing y as likely as possible

Why do we use the log likelihood instead of the likelihood?

Study These Flashcards

Since the log will give us the sum of all individual products instead of the product which is harder to compute. Log is a monotonically increasing function, the maximum of the likelihood will be at the same place as the log likelihood.

Write the log likelihood function for a linear regression model with normal distributed noise terms

What are dummy variables?

What is one-hot encoding?

One-hot encoding is a technique used in machine learning and data preprocessing to represent categorical data as binary vectors. In this encoding, each category or label is represented by a binary vector where only one element is "hot" (set to 1) while all others are "cold" (set to 0). This encoding allows ML algorithms to work with categorical data, as they typically require numerical input. One-hot encoding ensures that there is no inherent ordinal relationship between categories, making it suitable for various types of categorical data, such as color names, product categories, or city names

What is the easiest way to deal with categorical input variables for linear regression?

To create dummy variables (for two values) or one-hot encoding variables (for more than two values).

Week 2: Learning Linear & Logistic Regression Flashcards

(28 cards)