MT1 Flashcards

1
Q

Prediction

A

Given input X, we are interested predicting the output, Y.

Complicated models are good at prediction, but hard to understand.

100% Prediction: We care more about prediction accuracy, and will sacrifice interpret ability for that.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Inference

A

Given input X, we are interested in understanding it’s relationship with Y.

100% Inference: We care more about interpret ability, and will sacrifice accuracy for that.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Estimating f

A
  1. Gather data from a subset
    of the population of interest, because it is (usually) impossible to sample the entire true population. Through experimentation, observation etc.

We now have a set of TRAINING data where predictor X and response Y are BOTH known. The true relationship f between X and Y will never be known, but we want to get as close as possible.

  1. We want to predict what future unknown Y values will be based on given X values.
  2. Using the gathered data, we can try out different models on that data, to see which minimizes the residual error, and fits through testing and refinement, and use that to predict future values.
  3. We can split the original data set into training and testing, and test the chosen model on the testing.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Parameters

A

Quantities such as mean, std deviation, and proportions etc… are important values called the “parameters” of a TRUE population.

Since we will never know these true parameters, we calculate estimates of them from the sample data (subset) taken from the population. These estimations are called “Statistics”.

Statistics are estimations of the parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Parametric vs Non-parametric

A

Parametric: procedures rely on assumptions about the shape of the distribution of the underlying population from which the sample was taken. Most common assumption is that population is normally distributed. Generally better at inference.

Non-parametric has no assumptions about underlying population. The model structure is determined by the data. Generally better at prediction.

CAVEAT: Connect the dots: A perfect non-parametric fit, but horrible prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Response variable

A

Response variable Y will generally be in the form of categorical (color, shape etc…) or numeric.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

MSE

A

MEAN squared error: MSE is the distance from a response value Y in the training data, to the predicted response value (on the prediction line) at a give X value, squared.

We want to find a line that minimized the MSE for FUTURE predictions. THIS IS WHAT MAKES A GOOD MODEL!!!!!! Minimize the mean squared error for FUTURE observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Overfitting

A

Adding flexibility to the model (i.e. from linear to quadratic regression), will always decrease the MSE on the training data, but not necessarily the TESTING MSE.

i.e. Connect the dots fits the training data perfectly, (0 MSE) but does horribly on future observations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Irreducible error

A

The inherent natural variability of the true population of interest.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Error due to squared bias

A

This is a REducible error.

The inability for a statistical method to capture the TRUE relationship of the data.

If the average of a model’s predictions across different testing data are substantially different than the TRUE response values, that model is said to have high bias.

i.e. If we fit a linear model, to data whose true relationship is quadratic, it will have a higher MSE, it will have high bias.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Error due to variance

A

This is a REducible error.

The amount to which the MSE of a model fit varies across data sets.

  1. We have a set of training data.
  2. We choose a statistical method, and apply it to that training data, which generates a model fit representing a relationship (hopefully the true relationship), and a resulting MSE from that fit.
  3. We then apply that model to a new set of testing data, which results in a new MSE for the predictions.
  4. The difference between the training MSE fits, and the testing MSE fits is called the variance.
  5. If the MSE difference is very high, the model has high variance, and if the MSE difference is very low, the model has low variance.

Variance is only concerned with with how much the MSE of our chosen model fit varies between different data sets. NOT with how accurate it’s predictions are.

If we fit a highly quadratic (flexible) model to data whose true relationship is linear or close to linear (not flexible), it will fit the training data very well (the prediction line will go through, or be very close to the true response values MSE ~0, aka low bias), but once we apply that model to a new data set, sometimes it’s predictions may be good (low MSE), but sometimes the true response values will not fall close to that line anymore (since the line was so specific to the training data) and result in a much larger MSE. The MSE will vary a lot, meaning that it is hard to predict how well this model will fit future training sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Quick Bias vs Variance

A

Bias: The difference in MSE between the model fit, and the true relationship. Concerned with accuracy of the model.

Variance: The difference in MSE of the model fit across different data sets. Concerned with consistency of MSE across predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Overfitting

A

Overfitting is when a highly flexible model (i.e. quadratic) is chosen to fit training data whose true relationship is not very flexible (i.e. linear). This results in low bias, high variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Underfitting

A

Underfitting is when a low flexibility model (i.e linear or low quadratic) is chosen to fit a highly flexible mode. This results in high bias, low variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Classification

A

When Y is a categorical variable, then we must use classification techniques. Mean squared error no longer applies, so we are concerned with error rates.

Error rate is the number of times our model incorrectly classifies data. We are more interested in the error rate of the testing set, rather then the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Bayes Classifier

A

The Bayes Classifier is the true relationship of the data when the response variable is categorical.

It is the f that we are attempting to estimate, and has no reducible error, only irreducible error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

K-nearest Neighbors

A

This is a simple, non-parametric (no assumptions on underlying data) and lazy (minimal or no training phase) classification algorithm that attempts to estimate the Bayes classifier.

When a new data point is added to a data set, the algorithm looks at the K nearest data points around the new point. The majority class of K wins, and the new data is predicted as the majority class.

KNN predicts discrete values in classification.

It can also be used for regression, by finding the K nearest neighbors of a new continuous data point, and outputting the average of those K points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Regression

A

Regression (analysis) is a SET of statistical methods used to estimate the relationship between a response variable and one or more predictor variables. More specifically, regression estimates the average value of response Y, when one predictor varies, and all other predictors are held constant.

It is primarily used for prediction or forecasting, but can also show which predictors have the greatest influence on the response variable, as well as probability distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

KNN K=1

A

Each data point’s closest neighbor is itself, and will result in overfitting. The training data will have ZERO misclassifications because the classification boundary will separate all classes from each other (low bias), but the misclassification rate on testing data will vary widely (high variance).

Small values of K tend to favor classes in it’s immediate area.

20
Q

KNN K=n

A

The majority class from the training data will always be predicted, because there won’t be any classification boundaries.

All testing data will be predicted as the majority class from the training data (High bias), but we will always know what the prediction will be (low variance).

Large values of K tend to favor majority classes.

21
Q

P-value

A

The p-value is a measure of the strength against the null hypothesis.

The likelihood of the observation of interest occurring randomly if the NULL hypothesis were true.

Coin Flip:
Ho = Fair Coin
Ha = two tailed coin.

We get 6T in a row, which is a 1.5% probability of happening with a fair coin. That is very rare! P-value is 0.015.

For this example, it is so unlikely to have happened randomly, that we DON’T THINK IT WAS RANDOM. Reject the NULL.

22
Q

Linear (and multi) regression assumption

A

1–There exists some approximately linear relationship between Y and X. (no other kind of relationship)
2–The distribution of error has constant variance.
3–Error is normally distributed.
4–Errors are independent of each other, thus not affecting each other.

23
Q

Hypothesis testing steps.

A

1– Choose the null and alternative hypothesis
2– Decide on assumptions. i.e. significance level.
3– Test statistics and/or p-value.
4– Decision: Reject or fail to reject the null hypothesis.
5–Interpretation back to original context. What to do with the decision? Plain english.

24
Q

How to interpret ^B2

A

This is estimated slope of the second variable in a multiple linear regression (more than one predictor variable).
It is interpreted by holding all other predictors constant, and observing the resulting increase/decrease in the response variable.

25
Q

Testing multiple reg model ‘usefulness’

A

Hypotheses:
Ho: B1=B2=…=Bp = 0 (all true slopes =0)
Ha: Not all Bj are equal to 0. (not all estimated slopes are equal to zero)

Same assumptions a Linear reg.

26
Q

Forward Selection

A

A method of variable selection: Start with only the intercept and no predictors. Create a simple linear regression for each predictor. Add the one with the lowest RSS to the model. Continue until our chosen ‘best model’ measurement begins to decrease.

27
Q

Backward Selection

A

Start with all the predictors. Remove the predictor with the largest p-value, continue until all predictors are considered significant.

28
Q

Mixed selection

A

Start with no predictors. Continue with forward selection until variable’s p-value increases past some threshold. Continue until all significant.

29
Q

Linear to Logistic

A
Categories coded numerically for a response variable, fed into a continuous model assume a natural order, or hierarchy, to the response variable. 
For multiclass categorical data, this is super inappropriate.
30
Q

Logistic regression

A

LR predicts whether something is true or false, rather than a continuous value. Fits an S shape from 0 to 1. responses > .5 will be classified as 1, and responses

31
Q

R^2

A

A measure of how much variance in the response data, is explained by a given predictor.

Mouse weight: We have the weights of mice, and we want to find a predictor that explains the variation of the mouse weights (i.e. What is a good predictor of mouse weight?). We calculate the variance of the mean line of the mouse weights, and then find the variance of our linear regression.

Using size as a predictor, the R^2 is .81, meaning that the size of the mouse accounts for 81% of the variability in the data.

32
Q

Discriminant Analysis

A

DA focuses on maximizing the seperatibility among known categories, by reducing dimensionality.

33
Q

Discriminant

A

A characteristic that enables classes or categories to be distinguished from each other.

34
Q

Linear Discriminant Analyis

A

LDA maximizes the seperatibility of known categories by creating a new axis, so the dimensions can be reduced.

If we just have one variable predicting the categories of interest, then it is a number line, with the categories spread across, and we look for a value that maximizes the separation of the categories.

We can do better if we add another predicting variable, but now it is in 2D. We can’t just ignore one variable and project onto the other’s axis, because we’d lost that info.

LDA creates a axis (line through the data) at whichever angle maximizes the separation, and projects all the data onto that. Now we have a number line with better separation.

Goal is to optimize the distance between means, and scatter. (imagine squishing the data points together until maximum separation

For 3 or more categories, find the centroid for all data, , and maximize all the individual categories means distance from the centroid, while maximizing scatter distance.

35
Q

Logloss

A

Misclassification measure

36
Q

Euclidean Distance

A

“As the crow flies” straight line. Problematic for unsupervised models.

This is because: If we are interested in the height and annual salaries of people, $61 change in miniscule, but a 61 cm change in height is HUGE, and it is misleading to call points on the graph “equidistant”.

The data needs to be standardized. Scale data to have mean=0 var=1. So that data is scaled appropriately.

37
Q

Manhattan/City block distance

A

Measures distance by only moving along the axes.

38
Q

Mahalanobis distance

A

MD takes into account the covariance structure of the data.

It is a standardized euclidean distance. (Takes into account curvature.

39
Q

Binary distance

A

Ignore 0-0, only count 0-1, 1-0 for numerator,

0-1, 1-0, 1-1 for denominator.

40
Q

Gowers distance

A

Since data often has a mix of variable types, this computes pairwise distances.

Ensure that each variables total distance is standardized between 0 and 1, then sum them up!

41
Q

Clustering

A

Clustering is a form of unsupervised learning. It’s goal is to find groups in which observations are more similar to observations in their group, and more dissimilar to observations in other groups.

42
Q

Hierarchical Clustering

A
  1. Start with all obvs in their own group. groups=n
    2–Join the 2 closest observations (now n-1 groups)
    3–Recalculate distances
    4–repeat 2 and 3 until there is only one group.

Answers the question, what would n groups look like, but doesn’t tell us HOW MANY groups are in the data and what do they look like.

Cons: nxn distance matrices must be calculated, and very computaional time consuming for large samples. Sensetive to distance and linkage type.

43
Q

Types of linkage

A

Used for calcing dist in hierarchical clustering with a group consisting of more than one obvs.

Single linkage: dist from closest

44
Q

K-means clustering

A

The user specifies the number of groups they are looking for.

Randomly select k points in the data, these are the initial centroids. 
2--Assign all obvs the class of their nearest centroid. We now have K groups
3--Calculate the mean of each group. These are the new centroids.
4--Repeat until nothing changes. 

This works by finding by recursively finding the minimum within group sum of squared distances btw obvs and centroid.

Pros: Computationally efficient on large data sets.–Only n x k distance matrices needed. –Often provides clearer groups than Hierarchicall

Cons: Since initial centroids are randomly selected, results can be different each time. (local opt rather than global.)
–Groups are found NO MATTER WHAT. There might be no groups at all but still finds some.

45
Q

Cross validation

A

Regression: Main goal is to minimize the mean squared error for the model MSE. Predicts LONG RUN MSE.

Classification:
minimize/balance classification errors.

Most obvious option is to randomly split our data into training/testing. But what %? We want to fit, and test our model on as much data as possible.

CV is used as a systematic approach for selection among possible models. Very important!!!

46
Q

LOOCV

A

Leave one out cross validation is a systematic way to create multiple validation sets.

Create n training sets of size n-1, where each set has 1 obvs removed. This also leaves us with n validations of size 1.

We then predict the ith left out value using the ith model.

Pros: Less bias in estimating long run error. (doesn’t over estimate as much)–Assuming a deterministic model fit the LOOCV estimate of error is deterministic (never changes).
Cons: Requires n model fittings - Problem for large n values.–

47
Q

K Fold

A

Randomly subdivide the data in K equal, non over lapping sets. These sets are the validation/testing sets, and the rest is the training set. Then calc MSE.

Pros: (k=5 or 10 are common choices) Less model fitting than LOOCV (k vs n)–Less variance
Cons: More bias than LOOCV (smaller sample size at each model fit.)
–Non-deterministic estimate, since validation sets are being randomly selected. Different results each time. (bias/var trade off)

LOOCV is deterministic because every single n value will be removed for each set, every time. Not random.