review Flashcards
we come up with prediction in term of proabbility the use it to decide
if it 0 or 1 (categorical value)
transformation is non linear so we use
odds which is ratio of probablity/ 1- probability
in a regression outcome we are predicitng
log odds
baseline goal is to predict
wheter observation will be 0 or 1
baseline predicts
msot frequent outcome
which data set to use to find outcome of baseline model
training
how to build regression tree
splitting IV and predict most frequent outcome
how to come up with prediction
count the nu,bers of outcome per split
we choose how to define splits but use it
consisntley throught the model
how to decide where to split
First decide what objective (error points that misclassified) to minimize and maximize accuracy, try different points and select one that minimize error or max accuracy
most cases arent exact algorithim but
best found tree not optimal
Annova class is for prediciting variable
limitless where as classification is for probability between 0 and 1
continous is defined wiwthout a threshold how
most frewunt or verages
classfication problem deals with probability either 0 or 1 and when use probabilty always have
threshold, speceficty, senstivity –> ROC curve
single regression trees has high variance so prediciatbility will have
high variablity
how to fix high variablity issue
multiple treees in random forest then make prediciton based on most frequent outcome from multiple trees
- Build 5 trees and once have predictions from these trees then make prediction based on most frequent outcome-> can use idea of thresholds too how
if 3/5 trees = 1 then p y =1 is 0.6 then use this number to compare to thresholds
If threshold is 0.7 predict 0 because 0.6<= 0.7
we randomly selct what when build many trees
subset of variables. o Uses random row and column to make random and new data frame and based off this it is used to build trees and then repeat for multiple trees to make prediction on outcome
N tree is
parameter usually between 200-500
adjusting parameter doesnt really help becasue
default is good so just run algorithim
random forest is generalization of regression trees and always perforrms betetr but disaadvantage is
less interpratble