CHAP 7 : Logistic Regression Flashcards

Question 1

Q

What is logistic regression?

Answer

A

It is a supervised learning algorithm. It is a classfication algorithm that assigns data to a discrete set of classes

Question 2

Q

Give example(s) of classification problems.

Answer

A

Email classification : spam or not spam
Financial data analysis : fraud / not fraud
Credit analysis : approve or deny credit
Marketing : will buy or wont buy

basically a binary classificaion (only 2 classes)

Question 3

Q

What is the logistic function for logisitic regression? (analogous to best fit line eqn of linear regression)?

Answer

A

y hat = g(W.X^T),
g(X) = 1/(1+e^-z),

thus y hat = 1/(1+e^-(W.X^T))

Question 4

Q

What is the name for the logistic function?

Answer

A

Sigmoid function

Question 5

Q

From the values generated by the sigmoid function, how do the values get classified into class 0 or 1 by the classifier?

Answer

A

if the value < 0.5, the class value = 0. If the value >= 0.5, class value = 1.

Question 6

Q

What is the error function given by in logistic regression?

Answer

A

E(W) = 1/2N (summation (y(i) - yhat(i)^2) – refer to notes

Question 7

Q

Why cant we use the same error function (average MSE) as linear regression for logisitic regression?

Answer

A

There will be many local minima and the algorithm may be stuck in a local minima.

Question 8

Q

What is the cost function for logistic regression?

Answer

A

cost (yhat(x), y) =
-log(yhat(x)) if y = 1;
-log(1-yhat(x)) if y = 0.

See notes [we can rerite error function using the cost function]

Question 9

Q

How does the gradient descent algorithm work for logistic regression?

Answer

A

initialise W with random values or zeros
Loop till convergence
for each W(j) in W do :
w(j) = w(j) + L . 1/N summation (y(i) - y hat (xi))x(j)(i)) , where j –> jth col, ith col

see notes for equation.

Question 10

Q

What is a confusion matrix?

Answer

A

A confusion matrix is a performance measurement for machine learning classification

It presents a table layout of the different outcomes of the prediction and results of a classification problem and helps visualize its outcomes.

Values : True positive, true negative, false positive, false negative.

Question 11

Q

What is the difference between the training dataset and the validation dataset?

Answer

A

Training dataset is a set of examples used for learning, that is to fit the parameters of the classifier. A validation dataset contains different samples to evaluate trained ML models.

[The validation dataset is useful when it comes to hyper-parameter tuning and model selection. The validation examples included in this set will be used to find the optimal values for the hyper-parameters of the model under consideration.]

Question 12

Q

From the confusion matrix, there are 4 other metrics to evaluate classification output. What are they?

Answer

A

Precision
Recall (sensitivity)
F1 score
Support

Question 13

Q

What is precision, how is it calculated?

Answer

A

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. (High precision relates to the low false positive rate. )

Precision = TP/(TP+FP)

TP: true positive ; FP : False positive

Question 14

Q

What is recall and how is it calculated?

Answer

A

It is the ratio of correctly predicted positive observations to the all observations in actual class .

Recall = TP / TP+FN

FN: False negatives

Question 15

Q

What is F1 score and how is it calculated?

Answer

A

F1 Score is the weighted average of Precision and Recall

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Question 16

Q

What is support?

Answer

Study These Flashcards

A

Support is the number of actual occurrences of the class in the specified dataset, i.e., the support is the number of occurrences of each class in original y value from the dataset.

Question 17

Q

You are training a logistic regression model and you find that your training error is close to 0, but the testing error is very high. What can be done to improve this situation? Note: This situation is applicable to all the machine learning problems and not specific to logistic regression.

Answer

Study These Flashcards

A

Increase the training data size
Train on a combination of your training data and your test data but test only on your test data.

(: Since we are facing overfitting we can increase the training data size to combat this. Also, if we train on our test data our test loss will definitely improve dramatically, but you should never do this in practice because it defeats the purpose of testing, and will make performance worse when the model is deployed and used on new data.)

Question 18

Q

What are 2 kinds of validation methods used?

Answer

Study These Flashcards

A

K-fold cross validation
leave-one-out cross validation

Question 19

Q

How does k fold cross validation work?

Answer

Study These Flashcards

A

It splits the data into k folds, then trains the data on k-1 folds and test on the one fold that was left out. It does this for all combinations and averages the result on each instance.

Question 20

Q

What is the advantage of k fold cross validation?

Answer

Study These Flashcards

A

The advantage is that all observations are used for both training and validation, and each observation is used once for validation.

Question 21

Q

What is leave-one-out cross validaton?

Answer

Study These Flashcards

A

A variant of k-Fold CV is Leave-one-out Cross-Validation (LOOCV). LOOCV uses each sample in the data as a separate test set while all remaining samples form the training set. This variant is identical to k-fold CV when k = n (number of observations).

CHAP 7 : Logistic Regression Flashcards

(21 cards)