Week 6 DSE Flashcards
What do we do before we create a machine learning model?
q
Visualise data
How to use box plots to tell which variables are important?
Look at median
must be far apart
What must we do whenever we have a categoric variable?
convert to numeric
What are the ranges of result from the logistic model?
Probability Between 0 and 1
More often than not we care about the ______ and ______ of the slope ,b1.
sign
relative magnitude
When to use t value and why? For linear and multi
z-values: glm . Cause you CANT use least squares method . You use maximum likelihood for logistic which follows normal distribution
t-value: not glm, cause you use least squares method
What does glm stand for?
generalised LINEAR model
What must you do in r program to specify use of logistic model?
glm(default ~ balance, data = Default, family = binomial)
NEED TO SPECIFY FAMILY= BINOMIAL
How do i represent scaling a certian independent variable in R?
use I (Represents operations)
= glm(default ~ balance + I(income/1000) + student, data = Default, family = binomial)
What is sensitivity?
measures a classifier’s ability to identify positive status
p(tested POSITIVE | total that are actually positive)
how good we are at identifying positive cases out of all that are actually posittive
What is specificity?q
measures a classifier’s ability to identify negative status
p(tested NEGATIVE| total that are actually negative)
how good we are at identifying negative patients correctly
True negative
What are false positive?
fraction of cases that are ACTUALLY NEGATIVE, wrgonly classified as POSITIVE
What is true positive?
Fraction of cases that are ACTUALLY POSITITVE, that are correctly classified as POSITITVE
sensitivity
What happens as decision threshold increases?
FPR decreases. TPR decresaes
What do we use to measure optimal decision threshold?
Draw multiple ROC curve at differnet thresholds
Find the one with the biggest area under the curve (AUC).
What are on the axes on ROC curves?
Y axis: True positive
X axis: False posititve
What is the range of values for AUC?
0 to 1
Best one is the one closest to 1
What does model validation require?
Use another test dataset to see whehter it is good
split into train and test set
How to measure the accuracy of model?
Training error
What does test error quanitfy?
quantifies the predictability of the model.
How does training error compare to test error?
training error - overly optimistic (UNDERESSTIMATE THE TEST ERROR)
because the model has already seen the data already,cannot predict new data
What are the approaches to model validation?
validation-set approach
K-fold cross validation
Leave-one-out cross validation
What is the most commonly applied model validation method?
k-fold cross validation
GENERALLY, what are modle validation approaches like?
holding out subset(s) of the training observations from the model fitting process, and then applying the classifier to these held out observations.
What is a disadvantage of validation-set approaches?
There is a randomization part here, wher ethe erorr rate will be differnent if you divide your data differently
Explain what is validation-set approach?
(randomly divide the full data set into 2)
Explain what is the k-fold cross validation approach
- Randomly split the full data set into K folds of equal size.
- training set: k-1 folds, test set: 1 fold
- Iterate the process k times and then calculate average test erorr
Explain leave one out cross validation
when k=n
for a in range (n):
test set= 1 obv
training set= n-1 obv
How to predict multiple obv in R?
need to use the c() function
df_new = data.frame(student = c(“Yes”, “No”),
balance = c(1500, 1500), income = c(40000, 40000))
predict(glm_fit, newdata = df_new, type = “response”)
What does type=”response” mean?
return predicted probabilities
How to generate confusion matrix in R?
table(glm_prob >= 0.5, Default$default)
where glm_prob is the model AFTER PREDICTION
how to write code for k fold cross validation k=5?
set.seed(1101)
cv.glm(Default, glm_fit3, K = 5)$delta[1
Test error
How to write leave one out cross valdiation?
cv.glm(Default, glm_fit3)$delta[1] # LOOCV
How to get training error in R ?
glm_pred = ifelse(glm_prob_train > 0.5, “Yes”, “No”)
mean(glm_pred != Default_train$default) # training error