Advanced Regression Flashcards by Jacob Fritz

What contexts can trees be used in ?

-classification
-decision making
-logistic regression

How well did you know this?

Not at all

Perfectly

Trees in regression

split regression into branches where each branch is a different regression that fits the data better

How well did you know this?

Not at all

Perfectly

How it’s done in practice

-not a full regression
-simple regression only using constant term
-y = a0

a0 = sum i in node yi / # data points in node = avg response in node

How well did you know this?

Not at all

Perfectly

Branching method

start with half the data
-for each leaf
-calculate variance
-split on each factor, find biggest variance decrease, make split for biggest decrease (if more than threshold)
repeat until no split decreases variance more than threshold

then using the other half of the data for each branching point
-calculate estimation error with and without branching
-if branching increases error, remove branching

How well did you know this?

Not at all

Perfectly

Other branching methods

key idea
-using a metric related to the models quality
-find the best factor to branch wiht
-check: did this really improve the model?
if not, prune the branch back

rejecting a potential branch
-low imporvement benefit
-one side of the branch has too few data points
-rule of thumb- each leaf contains at least 5% of the original data

overfitting can be costly; make sure the benefit of each branch is greater than the cost

How well did you know this?

Not at all

Perfectly

random forest differences from branching

-introduce randomness
-generate many different trees
-different strengths and weeknesses
-average may be better than a single tree?

How well did you know this?

Not at all

Perfectly

How do we introduce randomness to random trees?

bootstrapping process
give each tree a slightly different set of data
if we start with n data points, each tree gets n data points, although it could have multiple of the same one or none of another

2branching
-choose 1 factor at a time
-randomly choose a small number of factors, set X (commonly 1 + log(X))
-choose the best factor within X to branch on

dont prune the tree

How well did you know this?

Not at all

Perfectly

Results of random forest

-each tree in the forest has slightly different data
-end up with many different trees (usually500-1000) (random forest)
-each tree may give us a different regression model (which to choose?)

How well did you know this?

Not at all

Perfectly

How do you pick the tree in the random forest?

You don’t use a single one!

-if it s regression tree you use the avg predicted response across all the trees in your forest.
-if its a classification tree, we use the mode ( most common predicted response)

How well did you know this?

Not at all

Perfectly

Benefits of random forest

better overall estimates - some may overfit, they don’t all overfit the same way
-avgs between trees somewhat neutralizes over-fitting

How well did you know this?

Not at all

Perfectly

drawback of random forest

-harder to explain/interpret the results
-doesn’t tell us how the variables interact or how a certain sequence of branches is helpful or meaningful - bc all branches are different

How well did you know this?

Not at all

Perfectly

random forest good in what situation?

-black box model/ default

-no good reason to try something else

-not good for detailed insight into whats going on

How well did you know this?

Not at all

Perfectly

Explainability/interpretability

how easy or difficult it is to know how models create their output

How well did you know this?

Not at all

Perfectly

ex of explainability: linear regression

y = a0 + sum j = 1 to n ajxij

how is the value of y affected by different values of the predictors?

baseline is a0 and each coefficient controls the “weight” of each variable ie how much it impects the response

How well did you know this?

Not at all

Perfectly

ex of explainability: linear regression tree

all of the results are conditional based on the branch they are in
-even harder with random forest

-it does give relative branching importance of each variable
-but not how, so not precise

How well did you know this?

Not at all

Perfectly

explainability tradeoff

Study These Flashcards

-the more we know how a model is getting it’s results the better we can understand why
-explain better to decision makers and choose to implement our ideas
-sometimes explainability is a legal requirement

-less explainable model sometimes give better results

use of logisitc regression

Study These Flashcards

have a probability between 0 and 1

logistic regression model

Study These Flashcards

Standard linear regression
y = a0 + a1x1 + a2x2 +… ajxj

logisitic regression
p = the probability of the event you want to observe

log (p/(1-p)) =a0 + a1x1 + a2x2 +… ajxj

p = 1/ (1+e^-(a0 + a1x1 + a2x2 +… ajxj))

if y = a0 + a1x1 + a2x2 +… ajxj = -infinit then p = 0
if y = a0 + a1x1 + a2x2 +… ajxj = ininity then p = 1

logitistc regression curve

Study These Flashcards

-looks like an s
-coefficients can change the shape
-data points are all at 0 or 1

logistic vs linear regression

Study These Flashcards

-similarities
-transformations of input data
-consider interaction terms
-variable selection
-logisitc regression tress
-random logisitc regression forests

log. reg. differences
-longer to compute
-no closed form solution
-understanding model quality

logistic vs linear regression measures of model quality

Study These Flashcards

lin. reg.
-r squared - fraction of variance explained by the model

log reg.
pseudo r squared value
-not really measuring fraction of variance

logistic regression classifications

Study These Flashcards

thresholding
-answer yes if probability p is at least some number
-otherwise no

ex
if p>= 0.7 give loan, otherwise no

receiver operating chacteristic curve (ROC)

Study These Flashcards

y = sensitivity = (true positives)/(true positives + false negatives)
x = specificity = (true negatives) / (true negatvies + false positives)

area under curve (auc) - prob that the model estimates a random yes point higher than a random no point ( ex loans)

auc =0.5 just guessing

value of roc/auc

Study These Flashcards

gives a quick and dirty estimate of quality -
-but does not differentiate between the cost of FN or FP
-highest value solutions? confusion matrix

confusion matrix

how much model is confusing each data point -TP- point in the category, correctly classified -FP - point not in category, model says it is -TN - point not in category, correctly classified -FN - point in the category, model says no

confusion matrix guidelines

positive - model says its in the category negative - model says it not in the category true - model got it right false- model got it wrong

sensitivity

fraction of category members that are correctly identifieid tp / (tp+fn)

specificity

fraction of non category members that are correctly identifieid tn/(tn+fp)

cost of lost productivity

how it costs for each box in the confusion matrix (tp and tn are obviously 0) depends on the percentage of responses that you're expecting

Poisson regression

-use when we think the response follows poisson distribution f(z) = lambda^z e^-lambda / z! extimate lambda(x)

regression splines

splines - function of polynomials that connect to each other allows us to fit different functions to different parts of the data set -smooth connections between the parts to ensure that we don't have drastically different answer for nearby point -order k - polynomails are order k -ex: multiadaptive regression splines (MARS) also called earch

Bayesian regression

start with -data -estimate of how the reg coefficients and randome error is distributed ex: predict how tall a child will be as an adult based on -data experts opinion - starting distribution use bayes theorem to update estimate -most helpful when not much data

Advanced Regression Flashcards

(33 cards)