predictive analytics; prediction Flashcards
confidence interval
quantifies uncertainty surrounding mean over a large quantity
prediction interval
quantifies uncertainty surrounding prediction of single quantity
machine learning definition
study of algorithms applied to data, focussing on prediction and classification
machine learning characteristics
- formulae like regression, but rules to produce yhat
- translated into computer code and automated
- predictions generated quickly and repeated without intervention
- takes error out of humans choosing variables
- challenges of interpretability and bias
regression process
- hypothesis
- select variables
- train
- test
- select best model
- refine
problems of machine learning
1) good for prediction, not explanatory
2) black box (don’t know why machine chooses variables)
3) ethical (sub-optimal outcomes like discrimination)
black swan
extreme outliers that have a disproportional effect (overfitting)
overfitting
- only see particular subset of data, and what is true is smth bigger we don’t observe
- trying so hard to explain what we see, so poor at explaining what we don’t see
avoiding overfitting
- split data into training and testing
- training: subset of data, eg 80% to estimate formula
- testing: 20% of data to test how well model predicts variables not shown
R squared
increases with number of variables
adjusted R squared
- penalises for number of variables
- asks if I gain much explanatory power for extra variables
- can go down if irrelevant variable is included
- bigger is better = more accurate
mean squared error
- common mean of selecting model
- smaller is better = less error
- best model = one with least errors
- high 𝑅^2 implies low MSE
regression tree
- output from machine learning
- segments inputs in mutually exclusive/exhaustive regions
- branches connect nodes to a terminal node (leaf)