Advanced Regression Flashcards
What contexts can trees be used in ?
-classification
-decision making
-logistic regression
Trees in regression
split regression into branches where each branch is a different regression that fits the data better
How it’s done in practice
-not a full regression
-simple regression only using constant term
-y = a0
a0 = sum i in node yi / # data points in node = avg response in node
Branching method
start with half the data
-for each leaf
-calculate variance
-split on each factor, find biggest variance decrease, make split for biggest decrease (if more than threshold)
repeat until no split decreases variance more than threshold
then using the other half of the data for each branching point
-calculate estimation error with and without branching
-if branching increases error, remove branching
Other branching methods
key idea
-using a metric related to the models quality
-find the best factor to branch wiht
-check: did this really improve the model?
if not, prune the branch back
rejecting a potential branch
-low imporvement benefit
-one side of the branch has too few data points
-rule of thumb- each leaf contains at least 5% of the original data
overfitting can be costly; make sure the benefit of each branch is greater than the cost
random forest differences from branching
-introduce randomness
-generate many different trees
-different strengths and weeknesses
-average may be better than a single tree?
How do we introduce randomness to random trees?
- bootstrapping process
give each tree a slightly different set of data
if we start with n data points, each tree gets n data points, although it could have multiple of the same one or none of another
2branching
-choose 1 factor at a time
-randomly choose a small number of factors, set X (commonly 1 + log(X))
-choose the best factor within X to branch on
- dont prune the tree
Results of random forest
-each tree in the forest has slightly different data
-end up with many different trees (usually500-1000) (random forest)
-each tree may give us a different regression model (which to choose?)
How do you pick the tree in the random forest?
You don’t use a single one!
-if it s regression tree you use the avg predicted response across all the trees in your forest.
-if its a classification tree, we use the mode ( most common predicted response)
Benefits of random forest
better overall estimates - some may overfit, they don’t all overfit the same way
-avgs between trees somewhat neutralizes over-fitting
drawback of random forest
-harder to explain/interpret the results
-doesn’t tell us how the variables interact or how a certain sequence of branches is helpful or meaningful - bc all branches are different
random forest good in what situation?
-black box model/ default
-no good reason to try something else
-not good for detailed insight into whats going on
Explainability/interpretability
how easy or difficult it is to know how models create their output
ex of explainability: linear regression
y = a0 + sum j = 1 to n ajxij
how is the value of y affected by different values of the predictors?
baseline is a0 and each coefficient controls the “weight” of each variable ie how much it impects the response
ex of explainability: linear regression tree
all of the results are conditional based on the branch they are in
-even harder with random forest
-it does give relative branching importance of each variable
-but not how, so not precise
explainability tradeoff
-the more we know how a model is getting it’s results the better we can understand why
-explain better to decision makers and choose to implement our ideas
-sometimes explainability is a legal requirement
-less explainable model sometimes give better results
use of logisitc regression
have a probability between 0 and 1
logistic regression model
Standard linear regression
y = a0 + a1x1 + a2x2 +… ajxj
logisitic regression
p = the probability of the event you want to observe
log (p/(1-p)) =a0 + a1x1 + a2x2 +… ajxj
p = 1/ (1+e^-(a0 + a1x1 + a2x2 +… ajxj))
if y = a0 + a1x1 + a2x2 +… ajxj = -infinit then p = 0
if y = a0 + a1x1 + a2x2 +… ajxj = ininity then p = 1
logitistc regression curve
-looks like an s
-coefficients can change the shape
-data points are all at 0 or 1
logistic vs linear regression
-similarities
-transformations of input data
-consider interaction terms
-variable selection
-logisitc regression tress
-random logisitc regression forests
log. reg. differences
-longer to compute
-no closed form solution
-understanding model quality
logistic vs linear regression measures of model quality
lin. reg.
-r squared - fraction of variance explained by the model
log reg.
pseudo r squared value
-not really measuring fraction of variance
logistic regression classifications
thresholding
-answer yes if probability p is at least some number
-otherwise no
ex
if p>= 0.7 give loan, otherwise no
receiver operating chacteristic curve (ROC)
y = sensitivity = (true positives)/(true positives + false negatives)
x = specificity = (true negatives) / (true negatvies + false positives)
area under curve (auc) - prob that the model estimates a random yes point higher than a random no point ( ex loans)
auc =0.5 just guessing
value of roc/auc
gives a quick and dirty estimate of quality -
-but does not differentiate between the cost of FN or FP
-highest value solutions? confusion matrix
confusion matrix
how much model is confusing each data point
-TP- point in the category, correctly classified
-FP - point not in category, model says it is
-TN - point not in category, correctly classified
-FN - point in the category, model says no
confusion matrix guidelines
positive - model says its in the category
negative - model says it not in the category
true - model got it right
false- model got it wrong
sensitivity
fraction of category members that are correctly identifieid
tp / (tp+fn)
specificity
fraction of non category members that are correctly identifieid
tn/(tn+fp)
cost of lost productivity
how it costs for each box in the confusion matrix (tp and tn are obviously 0)
depends on the percentage of responses that you’re expecting
Poisson regression
-use when we think the response follows poisson distribution
f(z) = lambda^z e^-lambda / z!
extimate lambda(x)
regression splines
splines - function of polynomials that connect to each other
allows us to fit different functions to different parts of the data set
-smooth connections between the parts to ensure that we don’t have drastically different answer for nearby point
-order k - polynomails are order k
-ex: multiadaptive regression splines (MARS) also called earch
Bayesian regression
start with
-data
-estimate of how the reg coefficients and randome error is distributed
ex: predict how tall a child will be as an adult based on
-data
experts opinion - starting distribution
use bayes theorem to update estimate
-most helpful when not much data