Week 6 DSE Flashcards by Hash Loi

What do we do before we create a machine learning model?
q

Visualise data

How well did you know this?

Not at all

Perfectly

How to use box plots to tell which variables are important?

Look at median

must be far apart

How well did you know this?

Not at all

Perfectly

What must we do whenever we have a categoric variable?

convert to numeric

How well did you know this?

Not at all

Perfectly

What are the ranges of result from the logistic model?

Probability Between 0 and 1

How well did you know this?

Not at all

Perfectly

More often than not we care about the ______ and ______ of the slope ,b1.

sign

relative magnitude

How well did you know this?

Not at all

Perfectly

When to use t value and why? For linear and multi

z-values: glm . Cause you CANT use least squares method . You use maximum likelihood for logistic which follows normal distribution

t-value: not glm, cause you use least squares method

How well did you know this?

Not at all

Perfectly

What does glm stand for?

generalised LINEAR model

How well did you know this?

Not at all

Perfectly

What must you do in r program to specify use of logistic model?

glm(default ~ balance, data = Default, family = binomial)

NEED TO SPECIFY FAMILY= BINOMIAL

How well did you know this?

Not at all

Perfectly

How do i represent scaling a certian independent variable in R?

use I (Represents operations)

= glm(default ~ balance + I(income/1000) + student, data = Default, family = binomial)

How well did you know this?

Not at all

Perfectly

What is sensitivity?

measures a classifier’s ability to identify positive status

p(tested POSITIVE | total that are actually positive)

how good we are at identifying positive cases out of all that are actually posittive

How well did you know this?

Not at all

Perfectly

What is specificity?q

measures a classifier’s ability to identify negative status

p(tested NEGATIVE| total that are actually negative)

how good we are at identifying negative patients correctly

True negative

How well did you know this?

Not at all

Perfectly

What are false positive?

fraction of cases that are ACTUALLY NEGATIVE, wrgonly classified as POSITIVE

How well did you know this?

Not at all

Perfectly

What is true positive?

Fraction of cases that are ACTUALLY POSITITVE, that are correctly classified as POSITITVE

sensitivity

How well did you know this?

Not at all

Perfectly

What happens as decision threshold increases?

FPR decreases. TPR decresaes

How well did you know this?

Not at all

Perfectly

What do we use to measure optimal decision threshold?

Draw multiple ROC curve at differnet thresholds

Find the one with the biggest area under the curve (AUC).

How well did you know this?

Not at all

Perfectly

What are on the axes on ROC curves?

Study These Flashcards

Y axis: True positive

X axis: False posititve

What is the range of values for AUC?

Study These Flashcards

0 to 1

Best one is the one closest to 1

What does model validation require?

Study These Flashcards

Use another test dataset to see whehter it is good

split into train and test set

How to measure the accuracy of model?

Study These Flashcards

Training error

What does test error quanitfy?

Study These Flashcards

quantifies the predictability of the model.

How does training error compare to test error?

Study These Flashcards

training error - overly optimistic (UNDERESSTIMATE THE TEST ERROR)

because the model has already seen the data already,cannot predict new data

What are the approaches to model validation?

Study These Flashcards

validation-set approach

K-fold cross validation

Leave-one-out cross validation

What is the most commonly applied model validation method?

Study These Flashcards

k-fold cross validation

GENERALLY, what are modle validation approaches like?

Study These Flashcards

holding out subset(s) of the training observations from the model fitting process, and then applying the classifier to these held out observations.

What is a disadvantage of validation-set approaches?

There is a randomization part here, wher ethe erorr rate will be differnent if you divide your data differently

Explain what is validation-set approach?

(randomly divide the full data set into 2)

Explain what is the k-fold cross validation approach

1. Randomly split the full data set into K folds of equal size. 2. training set: k-1 folds, test set: 1 fold 3. Iterate the process k times and then calculate average test erorr

Explain leave one out cross validation

when k=n for a in range (n): test set= 1 obv training set= n-1 obv

How to predict multiple obv in R?

need to use the c() function df_new = data.frame(student = c("Yes", "No"), balance = c(1500, 1500), income = c(40000, 40000)) predict(glm_fit, newdata = df_new, type = "response")

What does type="response" mean?

return predicted probabilities

How to generate confusion matrix in R?

table(glm_prob >= 0.5, Default$default) where glm_prob is the model AFTER PREDICTION

how to write code for k fold cross validation k=5?

set.seed(1101) cv.glm(Default, glm_fit3, K = 5)$delta[1 Test error

How to write leave one out cross valdiation?

cv.glm(Default, glm_fit3)$delta[1] # LOOCV

How to get training error in R ?

glm_pred = ifelse(glm_prob_train > 0.5, "Yes", "No") mean(glm_pred != Default_train$default) # training error

Week 6 DSE Flashcards

(34 cards)