Week 6 DSE Flashcards

1
Q

What do we do before we create a machine learning model?
q

A

Visualise data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How to use box plots to tell which variables are important?

A

Look at median

must be far apart

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What must we do whenever we have a categoric variable?

A

convert to numeric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the ranges of result from the logistic model?

A

Probability Between 0 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

More often than not we care about the ______ and ______ of the slope ,b1.

A

sign

relative magnitude

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

When to use t value and why? For linear and multi

A

z-values: glm . Cause you CANT use least squares method . You use maximum likelihood for logistic which follows normal distribution

t-value: not glm, cause you use least squares method

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does glm stand for?

A

generalised LINEAR model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What must you do in r program to specify use of logistic model?

A

glm(default ~ balance, data = Default, family = binomial)

NEED TO SPECIFY FAMILY= BINOMIAL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do i represent scaling a certian independent variable in R?

A

use I (Represents operations)

= glm(default ~ balance + I(income/1000) + student, data = Default, family = binomial)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is sensitivity?

A

measures a classifier’s ability to identify positive status

p(tested POSITIVE | total that are actually positive)

how good we are at identifying positive cases out of all that are actually posittive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is specificity?q

A

measures a classifier’s ability to identify negative status

p(tested NEGATIVE| total that are actually negative)

how good we are at identifying negative patients correctly

True negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are false positive?

A

fraction of cases that are ACTUALLY NEGATIVE, wrgonly classified as POSITIVE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is true positive?

A

Fraction of cases that are ACTUALLY POSITITVE, that are correctly classified as POSITITVE

sensitivity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What happens as decision threshold increases?

A

FPR decreases. TPR decresaes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What do we use to measure optimal decision threshold?

A

Draw multiple ROC curve at differnet thresholds

Find the one with the biggest area under the curve (AUC).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are on the axes on ROC curves?

A

Y axis: True positive

X axis: False posititve

17
Q

What is the range of values for AUC?

A

0 to 1

Best one is the one closest to 1

18
Q

What does model validation require?

A

Use another test dataset to see whehter it is good

split into train and test set

19
Q

How to measure the accuracy of model?

A

Training error

20
Q

What does test error quanitfy?

A

quantifies the predictability of the model.

21
Q

How does training error compare to test error?

A

training error - overly optimistic (UNDERESSTIMATE THE TEST ERROR)

because the model has already seen the data already,cannot predict new data

22
Q

What are the approaches to model validation?

A

validation-set approach

K-fold cross validation

Leave-one-out cross validation

23
Q

What is the most commonly applied model validation method?

A

k-fold cross validation

24
Q

GENERALLY, what are modle validation approaches like?

A

holding out subset(s) of the training observations from the model fitting process, and then applying the classifier to these held out observations.

25
Q

What is a disadvantage of validation-set approaches?

A

There is a randomization part here, wher ethe erorr rate will be differnent if you divide your data differently

26
Q

Explain what is validation-set approach?

A

(randomly divide the full data set into 2)

27
Q

Explain what is the k-fold cross validation approach

A
  1. Randomly split the full data set into K folds of equal size.
  2. training set: k-1 folds, test set: 1 fold
  3. Iterate the process k times and then calculate average test erorr
28
Q

Explain leave one out cross validation

A

when k=n

for a in range (n):
test set= 1 obv

training set= n-1 obv

29
Q

How to predict multiple obv in R?

A

need to use the c() function

df_new = data.frame(student = c(“Yes”, “No”),
balance = c(1500, 1500), income = c(40000, 40000))

predict(glm_fit, newdata = df_new, type = “response”)

30
Q

What does type=”response” mean?

A

return predicted probabilities

31
Q

How to generate confusion matrix in R?

A

table(glm_prob >= 0.5, Default$default)

where glm_prob is the model AFTER PREDICTION

32
Q

how to write code for k fold cross validation k=5?

A

set.seed(1101)
cv.glm(Default, glm_fit3, K = 5)$delta[1

Test error

33
Q

How to write leave one out cross valdiation?

A

cv.glm(Default, glm_fit3)$delta[1] # LOOCV

34
Q

How to get training error in R ?

A

glm_pred = ifelse(glm_prob_train > 0.5, “Yes”, “No”)
mean(glm_pred != Default_train$default) # training error