Week 2: Learning Linear & Logistic Regression Flashcards

1
Q

What is the loss function?

A

Is a measure that expresses the difference between actual outcome y and regression plane
θ^T x_i for one observation i.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

State the loss function for linear & logistic regression respectively.

A

Linear regr.: Squared error loss;
L(y,y.hat) = (y.hat - y)^2,

Log regr.: Cross-entropy loss,
L(y, g(x)) = {ln g(x) for y =1, ln (1-g(x) for y = -1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Which loss function do we actually use when training a classifier? Why not the misclassification?

A

The cross-entropy loss. Besides formulating the loss function not just in termms of the hard class prediction y.hat, we also incorporate the predicted class probabability g(x).

Not misclassification since,
1) using cross-entropy can result in a model that generalizes better from the training data, as the final prediction y.hat does not reveal all aspects of the classifier (this would mean pushing the decision boundaries further away from the training data points),

2) Misclassification would give us a piecewise constant cost function = impossible for numerical optimization as the gradient is zero everywhere (except where undefined).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the cross-entropy loss function and when is it used?

A

Commonly used in classification tasks, especially for binary and multiclass classification. It measures the dissimilarity between predicted class probabilities and true class probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we measure closeness in OLS regression?

A

By residual sum of squares (RSS). Pick values of theta such that RSS is minimized on the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How do we write RSS? (not matrix not.) = s

A

Sum (yi-yi^hat)^2 = sum (ei)^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do we generally model with both linear and logistic regression?

A

The conditional expectation of y given x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the perspective on linear regression in ML (compared to classical statistics)?

A

Emphasize on learning the function instead of estimating parameters (for inference, e.g.).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Formula for squared error loss?

A

(yi-theta^T * x_i) ^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Dimensions of X and Y in the training sample?

A

X has dim [n x (p+1)], Y has dim [n x 1].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The default vector, is it column or row?

A

Column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Difference between loss and cost functions?

A

Loss functions measures dissimilarity between observed output for observation i and the p-dim regression plane for one observation individually, while cost function measures the same dissimilarity but averaged over all observations in training sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Is linear regression parametric or non-parametric?

A

Parametric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Is logistic regression parametric or non-parametric?

A

Parameteric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

At what rate does the loss function grow as the difference between y and the prediction yhat(x;theta) increases?

A

Quadratically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What function and which error do we use in linear regression to solve the minimization problem?

A

Cost function (with squared error loss for learning)

17
Q

What is the cost function?

A

It represents the overall performance of a machine learning model across all data points and is used to update the model’s parameters during training. It is calculated as the average loss over all data points in T.

It is then typically used in the optimization process, such as gradient descent, to adjust the model’s parameters to minimize the cost.

18
Q

What are the normal equations?

A

The equations that we can, by taking 1st derivative and setting theta-hat equal to zero, use in order to find the OLS/ML estimates of beta-hat.

19
Q

Formula for normal equations (vector not.)?

A

(X^T X)^(-1) X^T y

20
Q

Illustrate the squared error loss function with a graph!

A
21
Q

What is another name for the cost function using squared error loss?

A

Least squares cost

22
Q

Learn with loss on T and minimize with cost?

A
23
Q

What does it mean to maximize the (log) likelihood functon?

A

To find the values of theta that makes observing y as likely as possible

24
Q

Why do we use the log likelihood instead of the likelihood?

A

Since the log will give us the sum of all individual products instead of the product which is harder to compute. Log is a monotonically increasing function, the maximum of the likelihood will be at the same place as the log likelihood.

25
Q

Write the log likelihood function for a linear regression model with normal distributed noise terms

A
26
Q

What are dummy variables?

A
27
Q

What is one-hot encoding?

A

One-hot encoding is a technique used in machine learning and data preprocessing to represent categorical data as binary vectors. In this encoding, each category or label is represented by a binary vector where only one element is “hot” (set to 1) while all others are “cold” (set to 0). This encoding allows ML algorithms to work with categorical data, as they typically require numerical input. One-hot encoding ensures that there is no inherent ordinal relationship between categories, making it suitable for various types of categorical data, such as color names, product categories, or city names

28
Q

What is the easiest way to deal with categorical input variables for linear regression?

A

To create dummy variables (for two values) or one-hot encoding variables (for more than two values).