L5: Logistic Regression Flashcards by Sandie Huang

In the context of machine learning, “induction” refers to the process of learning patterns, rules, or models from data. It involves generalizing from specific examples to make predictions or decisions about new, unseen data.

TRUE/FALSE

TRUE

How well did you know this?

Not at all

Perfectly

What is the problem of induction? How does it relate to training and testing ML models? Select all correct options

A) it can lead to overfitting
B) very general, simple rules might be more robust across time and change
C) complex and detailed models may become outdated as conditions and interdependencies change

All options are correct:

A) it can lead to overfitting
B) very general, simple rules might be more robust across time and change
C) complex and detailed models may become outdated as conditions and interdependencies change

How well did you know this?

Not at all

Perfectly

Why is linear regression a form of supervised learning?

You are giving the model “the ground truth”

How well did you know this?

Not at all

Perfectly

The term “ground truth” is used in the context of training supervised ML models. During the training phase, the model learns from the input features and their corresponding ground truth labels.

TRUE/FALSE

TRUE

How well did you know this?

Not at all

Perfectly

In predictive models, we care less about interpreting the model _____. We just want to know how well the model ______

Fill in the blanks

In predictive models, we care less about interpreting the model COEFFICIENTS. We just want to know how well the model PERFORMS IN PREDICTION

How well did you know this?

Not at all

Perfectly

Which of the following statements are true about stratified random sampling?
A) It helps ensure similar representative distributions in train/test sets
B) When subpopulations within an overall population vary, it can be advantageous to sample each subpopulation independently - i.e., the sample incl. representation from each subgroup
C) it helps mitigate the “luck” component when splitting data into testing and training , e.g., that you don’t draw a particularly unlucky/ lucky train set that captures relations that just happened to be in that part of the sample
D) It helps solve dataset imbalance

All options are true

How well did you know this?

Not at all

Perfectly

When using K-fold cross-validation, which of the following statements are not true?
A) it splits the training data into K folds, where K-1 is used for training and remainder for testing
B) it is a technique for evaluating predictive model performance
C) the model is trained and evaluated k times, using a different fold as the validation set each time
D) the highest performing fold comprises the model’s generalisation performance

Wrong: D
Performance metrics from each fold are averaged to estimate the model’s generalisation performance

How well did you know this?

Not at all

Perfectly

What is the trade-off when choosing number of folds?

The higher the K, the more thoroughly the model is trained and evaluated - higher model precision, but the higher the computational power requirement/ time

How well did you know this?

Not at all

Perfectly

What is the purpose of setting “seed”?

Each “seed” represents a random draw. When setting a specific seed in R, it ensures that results are the same every time you use the same data.

How well did you know this?

Not at all

Perfectly

In our project, we use the double splitting procedure. Explain what this is.

Hint: has something to do with which part of the dataset K-fold cross validation is performed on

We use k-fold on the training dataset, where the cross-validation takes place in the training dataset alone.
Then, we keep a spare 40% of data as the test set to assess model performance

How well did you know this?

Not at all

Perfectly

Which of the following statements are true about the LOOCV cross validation method?

A) stands for leave on out cross validations
B) if n datapoints, the model is trained on n-1 datapoints
C) Model is built using only data from training set
D) Model is then sued to predict the response value of the one observation left out from training set
E) the procedure is repeated n times (n= total number of observations in dataset)
F) every repetition (n times), a different observation is left out of the train set every time
G) the mean squared error is then calculated as the avg. of SME of all the test runs
H) very useful for large datasets

all options a correct except H:

H) very useful for large datasets

Instead, LOOCV exhibits utility for small datasets. For the hotel demand dataset, with +100k observations, this cross validation technique would require +100k repetitions where one datapoint is held out every time. This would take too long

How well did you know this?

Not at all

Perfectly

Logistic regression is similar to linear regression. But what contexts render the former more useful?

Logistic regression is similar to linear regression, but used when trying to predict a binary outcome, i.e., for a categorical response variable (1/0 or yes/no)

How well did you know this?

Not at all

Perfectly

Standard linear model doesn’t work for probabilities, which are bounded by 0 and 1

TRUE/FALSE

TRUE.

Indeed, it is in these cases that logistic regression is more useful

How well did you know this?

Not at all

Perfectly

Explain what “odds” represent

Odds = P/1-P

Odds are a ratio that quantifies the relationship between the likelihood of an event happening and the likelihood of it not happening

How well did you know this?

Not at all

Perfectly

what is logit mathematically?

Logarithm of odds =
Log(odds) = Log (P/1-P)

In logistic regression, logit is given by:
Ln(P/(1-P)) = β0+β1 x1+β2 x2+βq xq = log⁡(Odds)

How well did you know this?

Not at all

Perfectly

Sigmoid curve represents the graphical relationship between _____ and _____
A) probability of success and logit
B) probability of success and sigmoid function
C) probability of success and odds

A)
Sigmoid curve represents the graphical relationship between probability of success and logit

How well did you know this?

Not at all

Perfectly

What shape is the sigmoid curve?

S-shaped, where if probability of success is 0, the logit is negative, and when it is 1, the logit is positive

See. pp. 66 in lecture notes for visualisation

How well did you know this?

Not at all

Perfectly

what does the “e” mean in the logistic regression function?
p=1/(1+e^(-(β_0+β_1 x_1+β_2 x_2+β_q x_q ) ) )

“e” is a constant that helps us understand how the probability of an event grows or decays as a function of the model’s input features

How well did you know this?

Not at all

Perfectly

What does it mean to say the effect of an independent variable is linear/non-linear?
Select all correction options:
A) it relates to the change in the dependent variable from a change in an independent variable
B) when linear, a one-unit change in IV leads to a constant change in the DV
C) when linear, the coefficients of the independent variables remain constant
D) when non-linear, the change in the DV from a one-unit change is not constant
E) non-linear effects are usually represented graphically by a straight line

A) it relates to the change in the dependent variable from a change in an independent variable
B) when linear, a one-unit change in IV leads to a constant change in the DV
C) when linear, the coefficients of the independent variables remain constant
D) when non-linear, the change in the DV from a one-unit change is not constant

FALSE: E) non-linear effects are usually represented graphically by a straight line
Instead: non-linear effects can take various forms, incl. quadratic, exponential, logarithmic, etc.

How well did you know this?

Not at all

Perfectly

Which of the following options constitute reasons for logistic regression being superior in binary classification relative to linear reg.?
A) it models the probability of an event occurring and transforms this probability using the logistic function to ensure predictions fall between 0 and 1
B) it is specifically designed for binary outcomes
C) it provides meaningful and interpretable results in terms of odds ratios and probabilities
D) linear regression is just as good as logistic regression for binary classification

A) it models the probability of an event occurring and transforms this probability using the logistic function to ensure predictions fall between 0 and 1
B) it is specifically designed for binary outcomes
C) it provides meaningful and interpretable results in terms of odds ratios and probabilities

D is wrong

How well did you know this?

Not at all

Perfectly

What is a logit and why do we use it?
A) it is a critical component in log. reg.
B) it enables the modelling of the relationship between predictors and the prob. of a binary outcome
C) it transforms probabilities into a linear space, facilitating the estimation of coefficients
D) the coefficients represent the change in the log odds of the outcome for a one-unit change in the given predictor

All options are true
A) it is a critical component in log. reg.
B) it enables the modelling of the relationship between predictors and the prob. of a binary outcome
C) it transforms probabilities into a linear space, facilitating the estimation of coefficients
D) the coefficients represent the change in the log odds of the outcome for a one-unit change in the given predictor

How well did you know this?

Not at all

Perfectly

How are odds, probabilities, and log odds related? If an event has a 20% probability of occurring, what is the odds of it occurring?

In log. reg., log odds are modelled linearly with predictor variables. The logit function facilitates the transformation between probabilities, odds, and log odds in a mathematically convenient way.

How well did you know this?

Not at all

Perfectly

Transform probability (p) to odds

odds = p/(1-p)

How well did you know this?

Not at all

Perfectly

Transform odds to logit

Study These Flashcards

Logit = log(odds) = log(p/(1-p))

Transform odds to probability (p)

p = odds / (1+odds)

Logistic regression can only be used for predictive tasks. TRUE/FALSE

FALSE log. reg. can be used for the purpose of "profiling", which is an explanatory modelling task that involves interpreting the estimated coefficients for each predictor in the model log. reg. can also be used for ranking/discrimination, which entails finding a certain percentage of the sample that are most likely to be outcome 1

Logistic regression can be used for which tasks? A) explanation/ profiling B) prediction C) ranking/ discrimination D) all of the above

D) all of the above

Which of the following does not constitute an advantage of log. reg.? A) It is fairly easy to interpret with coefficients representing the change in log odds (output) associated with a one-unit change in the corresponding predictor B) It provides a clear interpretation of the effect of each feature C) It allows the user to explain the reasons behind the given outcome - i.e., based on the coefficients and predictor value D) It can easily be scaled to complex prediction problems with hundreds and thousands of predictor variables

All options are true A) It is fairly easy to interpret with coefficients representing the change in log odds (output) associated with a one-unit change in the corresponding predictor B) It provides a clear interpretation of the effect of each feature C) It allows the user to explain the reasons behind the given outcome - i.e., based on the coefficients and predictor value D) It can easily be scaled to complex prediction problems with hundreds and thousands of predictor variables

The basic idea of logistic regression is that we fit a model to our predictors that is _____ in terms of logits, and _____ in terms of probabilities. Fill in the blanks: A) linear and linear B) linear and non-linear C) non-linear and linear D) non-linear and non-linear

B) The basic idea of logistic regression is that we fit a model to our predictors that is LINEAR in terms of logits, and NON-LINEAR in terms of probabilities.

Based on the coefficient of a given predictive variable, what do you do to make it interpretable?

The logit function provides the logit based on the model's intercept, all its predictor variables and their corresponding coefficients. By then exponentiating the log odds (logit), you convert them into odds. If odds > 1, it suggest a positive effect on the probability of an event, and if odds <1, it suggests a negative effect

In general, odds > 1 suggest a ____ effect on the probability of an event, while odds < 1 suggest a ____ effect. Plug in the missing words

In general, odds > 1 suggest a POSITIVE effect on the probability of an event, while odds < 1 suggest a NEGATIVE effect.

The estimated coefficient represents the expected change in logits (i.e., log odds) of delay for a one-unit change in departure time. TRUE/FALSE

TRUE

The effect of an input on lead time is relative to the input value of lead time. This statement reflects what concept i linear regression?

Non-linearity The basic idea is that, at very low and high values of the explanatory variable , a 1 unit increase will have an increasingly small effect on the output. This is by design because we must bound the output between [0-1] if we want to interpret the result as a probability.

Caret uses stratified random sampling to maintain similar outcome distributions of the response variable in the training and testing sets. TRUE/ FALSE

TRUE Caret does this automatically

Define hyperparameter tuning

Hyperparameter tuning refers to setting the configuration settings, external to the model itself and are not learned from the data during training, to maximise the predictive performance of the model

Which of the following statements are NOT true about maximum likelihood estimation (MLE)? A) it is used to estimate coefficients in log. reg. B) the goal of i MLE is to find the values of the coefficients that maximises the likelihood of observing the actual outcomes given the model C) it is used on linear regression D) they result in a best fitting sigmoid-shaped curve to the data rather than a best fitting line

WRONG: C) it is used on linear regression CORRECT: A) it is used to estimate coefficients in log. reg. B) the goal of i MLE is to find the values of the coefficients that maximises the likelihood of observing the actual outcomes given the model D) they result in a best fitting sigmoid-shaped curve to the data rather than a best fitting line

Why is the process of finding the optimal model coefficients different in log. reg. from lin. reg.?

The residuals/ difference between predicted value and actual value in log. reg. follow a binomial distribution (value between 0 and 1rather than norm. dist.). Thus, coefficients is estimated using MLE, resulting in a best fitting sigmoid shaped curve instead of a line

The metrics of interest when you care about the POSITIVE outcome more than another are: A) Negative predictive value B) Sensitivity C) Specificity D) Positive predictive value

The metrics of interest when you care about the POSITIVE outcome more than another are: B) Sensitivity (i.e., recall or true positive rate) D) Positive predictive value

The metrics of interest when you care about the NEGATIVE outcome more than another are: A) Negative predictive value B) Sensitivity C) Specificity D) Positive predictive value

A) Negative predictive value C) Specificity

What is the formula for sensitivity/ true positive rate/ recall? Hint: it measures the proportion of actual positive instances that were predicted by the model out of all positive instances

Sensitivity = TP/(TP+FN)

What is the formula for specificity/ true negative rate ? Hint: it measures the proportion of actual negative instances that were predicted by the model out of all negative instances

Specificity = TN/ (TN+FP)

What is the formula for negative predictive value? Hint: it measures the proportion of actual negative instances that were predicted by the model total negatives that the model predicted

Negative predictive value: NPV = TN/(TN+FN)

What is the formula for positive predictive value? Hint: it measures the proportion of actual positive instances that were predicted by the model total positives that the model predicted

Positive predictive value: PPV = TP/(TP+FP)

What does accuracy measure?

Accuracy measures the % of correct predictions of the model - it treats true positive and true negative equally

The ROC and AUC is dependent on the chosen cut-off threshold/ value TRUE/FALSE

FALSE: the ROC curve plots a sequence of confusion matrix statistics over ALL cutoff values. The AUC summarises classifier performance over all threshold values

What is the difference between discrimination and calibration? A) discrimination involves only correctly ranking true positive cases above true negative cases B) calibration involves only correctly ranking true positive cases above true negative cases C) discrimination requires predicting the probability itself accurately rather than ranked probabilities C) calibration requires predicting the probability itself accurately rather than ranked probabilities

A) discrimination involves only correctly ranking true positive cases above true negative cases C) calibration requires predicting the probability itself accurately rather than ranked probabilities

Which of the following statements are true about the ROC? Select all correct A) it shows how sensitivity and 1- specificity vary as a function of every given cutoff threshold B) in the bottom left corner, the threshold is very low, indicating that both sensitivity and 1-specifcity are near 0

TRUE: A) it shows how sensitivity and 1- specificity vary as a function of every given cutoff threshold FALSE: B) in the bottom left corner, the threshold is very low, indicating that both sensitivity and 1-specifcity are near 0 Instead: in the bottom left corner, the threshold is very HIGH, indicating that both sensitivity and 1-specifcity are near 0

Which of the following statements are NOT true about the AUC? A) it represents the % of times where it correctly ranks the true positive over false positive B) the higher the AUC, the higher the true positive rate relative to false positive rate C) AUC = 0.5 indicates that the model's ability to rank true positive over true negative cases is equal to tossing a coin D) AUC = 1 indicates the best classifier possible with an ROC curve occupying a lot of space in the top left region E) AUC = 100 = true positive rate of 0%

WRONG: E) AUC = 100 = true positive rate of 0% Instead: AUC=100=TPP of 100% But note, in practice, this is nearly impossible

Which of the following statements are TRUE about gains curves? A) it gives a visualisation, plotting # cases ordered by predicted probabilities for being positive (x) against cumulative pct. of the positive class instances captured by the model (y) B) the visualisation is useful when comparing the model to a benchmark (e.g., naive model guessing based on mean) C) it is useful under resource constrained conditions where the user cannot afford to take action for all predicted positives, only top x% D) within x% subset records, we hope to have correctly identified most or all of the true positives E) the gains curve tells you how much better your model does at skimming the cream/ rank ordering the positives vs. negatives relative to benchmark F)

Which of the following are true about Kappa? A) it is a measure for assessing the inter-rater agreement in binary classification problems B) it ranges from 0-1 C) it is useful when dealing with imbalanced class distribution in the data D) higher kappa values indicate better agreement between predicted and actual classifications vs. chance E) it does not take into account the relative preference of capturing true positives vs. true negatives

TRUE: A) it is a measure for assessing the inter-rater agreement in binary classification problems C) it is useful when dealing with imbalanced class distribution in the data D) higher kappa values indicate better agreement between predicted and actual classifications vs. benchmark E) it does not take into account the relative preference of capturing true positives vs. true negatives FALSE: B) it ranges actually from -1 - 1 1: perfect agreement 0: agreement equivalent to chance <0: agreement worse than chance

Which of the following are true about No information rate? A) it represents the accuracy a model would achieve by always predicting the most prevalent class in the dataset B) NIR thus provides a baseline accuracy against which the model's performance can be compared C) NIR is calculated as the proportion of the majority class in the dataset D) If 70% of the instances are actually positive and 30% are negative, the NIR would be 0.4

TRUE: A) it represents the accuracy a model would achieve by always predicting the most prevalent class in the dataset B) NIR thus provides a baseline accuracy against which the model's performance can be compared C) NIR is calculated as the proportion of the majority class in the dataset FALSE: D) If 70% of the instances are actually positive and 30% are negative, the NIR would be 0.7 --> this means that the benchmark will always predict positive, and it will be right 70% of the time

Kappa is often compared to the No Information Rate to determine how much better the model's performance is than what could be achieved by simply predicting the majority class. If Kappa > NIR, it suggests that the model is providing meaningful prediction beyond what would be expected by always predicting the majority class TRUE/ FALSE

TRUE

Which of the following are true about a decile-wise lift chart? A) a variation of gains curve B) the dataset is divided into ten equal-sized groups/ deciles based on the predicted probabilities generated by the model C) visual representation that compares the performance of a predictive model to a baseline model D) x-axis plots the deciles with each point on the x-axis corresponding to 1/10, ranked by highest likelihood of cancellation/positive in percentile E) Y-axis plots the lift: cumulative pct. of the positive class instances within each decile. It shows how well the model identifies positive instance within each segment F) interpretation: if 1st decile shows 2.7, it means that the model gets 2.7x more true negatives than the naive/ majority guessing model in the first decile (most likely to cancel decile)

All answers are TRUE: For recap, see following Rmd under variation of gains curves: decile-wise lift charts

In logistic regression, we exponentiate the learned coefficients corresponding to a given predictor to calculate the effect of a predictor on the odds of the outcome TRUE/FALSE

TRUE

Logistic regression can be used for which three tasks?

Prediction Profiling (explanatory tasks) Ranking

The AUC tells us how well our classifier ranks true positive cases over true negative cases, and provides thus a measure for the overall precision of the model, not taking into account the cost of false positives relative to false negatives TRUE/FALSE

TRUE

Which of the following statements are true regarding the difference/ similarity between "gain" and "lift"? A) "gain" refers to the improvement in performance achieved by the predictive model compared to a baseline random model B) "lift" quantifies how many times better the model is at identifying positive instances compared to a random baseline model C) both measures are commonly visualised using gain charts or lift charts D) they are especially useful in business settings where there is a limited budget to act on predictions

All are true: A) "gain" refers to the improvement in performance achieved by the predictive model compared to a baseline random model B) "lift" quantifies how many times better the model is at identifying positive instances compared to a random baseline model C) both measures are commonly visualised using gain charts or lift charts D) they are especially useful in business settings where there is a limited budget to act on predictions

Which of the following statements are TRUE about Logistic regression? A) it is a statistical method used for predicting the probability of a binary outcome. B) in simple terms, it helps us answer yes-or-no questions. C) it employs a parametric approach, imposing structure of a problem by examining the relationship between inputs and output D) it is used to predict probabilities of binary outcomes and make binary classifications (1/0). I.e., it predicts the probability that an input belongs to one of the two categories

All are true: A) it is a statistical method used for predicting the probability of a binary outcome. B) in simple terms, it helps us answer yes-or-no questions. C) it employs a parametric approach, imposing structure of a problem by examining the relationship between inputs and output D) it is used to predict probabilities of binary outcomes and make binary classifications (1/0). I.e., it predicts the probability that an input belongs to one of the two categories

LOGISTIC REGRESSION is also used when the outcome variable is continuous and can take any numerical value. It predicts a straight-line relationship between the input features and the continuous outcome. For example, it's suitable for predicting things like temperature, stock prices, or house prices. TRUE/FALSE

FALSE. Instead: LINEAR REGRESSION is used when the outcome variable is continuous and can take any numerical value. It predicts a straight-line relationship between the input features and the continuous outcome. For example, it's suitable for predicting things like temperature, stock prices, or house prices.

L5: Logistic Regression Flashcards

(59 cards)