MT1 Flashcards
Prediction
Given input X, we are interested predicting the output, Y.
Complicated models are good at prediction, but hard to understand.
100% Prediction: We care more about prediction accuracy, and will sacrifice interpret ability for that.
Inference
Given input X, we are interested in understanding it’s relationship with Y.
100% Inference: We care more about interpret ability, and will sacrifice accuracy for that.
Estimating f
- Gather data from a subset
of the population of interest, because it is (usually) impossible to sample the entire true population. Through experimentation, observation etc.
We now have a set of TRAINING data where predictor X and response Y are BOTH known. The true relationship f between X and Y will never be known, but we want to get as close as possible.
- We want to predict what future unknown Y values will be based on given X values.
- Using the gathered data, we can try out different models on that data, to see which minimizes the residual error, and fits through testing and refinement, and use that to predict future values.
- We can split the original data set into training and testing, and test the chosen model on the testing.
Parameters
Quantities such as mean, std deviation, and proportions etc… are important values called the “parameters” of a TRUE population.
Since we will never know these true parameters, we calculate estimates of them from the sample data (subset) taken from the population. These estimations are called “Statistics”.
Statistics are estimations of the parameters
Parametric vs Non-parametric
Parametric: procedures rely on assumptions about the shape of the distribution of the underlying population from which the sample was taken. Most common assumption is that population is normally distributed. Generally better at inference.
Non-parametric has no assumptions about underlying population. The model structure is determined by the data. Generally better at prediction.
CAVEAT: Connect the dots: A perfect non-parametric fit, but horrible prediction.
Response variable
Response variable Y will generally be in the form of categorical (color, shape etc…) or numeric.
MSE
MEAN squared error: MSE is the distance from a response value Y in the training data, to the predicted response value (on the prediction line) at a give X value, squared.
We want to find a line that minimized the MSE for FUTURE predictions. THIS IS WHAT MAKES A GOOD MODEL!!!!!! Minimize the mean squared error for FUTURE observations.
Overfitting
Adding flexibility to the model (i.e. from linear to quadratic regression), will always decrease the MSE on the training data, but not necessarily the TESTING MSE.
i.e. Connect the dots fits the training data perfectly, (0 MSE) but does horribly on future observations.
Irreducible error
The inherent natural variability of the true population of interest.
Error due to squared bias
This is a REducible error.
The inability for a statistical method to capture the TRUE relationship of the data.
If the average of a model’s predictions across different testing data are substantially different than the TRUE response values, that model is said to have high bias.
i.e. If we fit a linear model, to data whose true relationship is quadratic, it will have a higher MSE, it will have high bias.
Error due to variance
This is a REducible error.
The amount to which the MSE of a model fit varies across data sets.
- We have a set of training data.
- We choose a statistical method, and apply it to that training data, which generates a model fit representing a relationship (hopefully the true relationship), and a resulting MSE from that fit.
- We then apply that model to a new set of testing data, which results in a new MSE for the predictions.
- The difference between the training MSE fits, and the testing MSE fits is called the variance.
- If the MSE difference is very high, the model has high variance, and if the MSE difference is very low, the model has low variance.
Variance is only concerned with with how much the MSE of our chosen model fit varies between different data sets. NOT with how accurate it’s predictions are.
If we fit a highly quadratic (flexible) model to data whose true relationship is linear or close to linear (not flexible), it will fit the training data very well (the prediction line will go through, or be very close to the true response values MSE ~0, aka low bias), but once we apply that model to a new data set, sometimes it’s predictions may be good (low MSE), but sometimes the true response values will not fall close to that line anymore (since the line was so specific to the training data) and result in a much larger MSE. The MSE will vary a lot, meaning that it is hard to predict how well this model will fit future training sets.
Quick Bias vs Variance
Bias: The difference in MSE between the model fit, and the true relationship. Concerned with accuracy of the model.
Variance: The difference in MSE of the model fit across different data sets. Concerned with consistency of MSE across predictions.
Overfitting
Overfitting is when a highly flexible model (i.e. quadratic) is chosen to fit training data whose true relationship is not very flexible (i.e. linear). This results in low bias, high variance.
Underfitting
Underfitting is when a low flexibility model (i.e linear or low quadratic) is chosen to fit a highly flexible mode. This results in high bias, low variance.
Classification
When Y is a categorical variable, then we must use classification techniques. Mean squared error no longer applies, so we are concerned with error rates.
Error rate is the number of times our model incorrectly classifies data. We are more interested in the error rate of the testing set, rather then the training set.
Bayes Classifier
The Bayes Classifier is the true relationship of the data when the response variable is categorical.
It is the f that we are attempting to estimate, and has no reducible error, only irreducible error.
K-nearest Neighbors
This is a simple, non-parametric (no assumptions on underlying data) and lazy (minimal or no training phase) classification algorithm that attempts to estimate the Bayes classifier.
When a new data point is added to a data set, the algorithm looks at the K nearest data points around the new point. The majority class of K wins, and the new data is predicted as the majority class.
KNN predicts discrete values in classification.
It can also be used for regression, by finding the K nearest neighbors of a new continuous data point, and outputting the average of those K points.
Regression
Regression (analysis) is a SET of statistical methods used to estimate the relationship between a response variable and one or more predictor variables. More specifically, regression estimates the average value of response Y, when one predictor varies, and all other predictors are held constant.
It is primarily used for prediction or forecasting, but can also show which predictors have the greatest influence on the response variable, as well as probability distributions.