MT1 Flashcards

Question

Testing multiple reg model 'usefulness'

Answer 1

Hypotheses: Ho: B1=B2=...=Bp = 0 (all true slopes =0) Ha: Not all Bj are equal to 0. (not all estimated slopes are equal to zero) Same assumptions a Linear reg.

Answer 2

A method of variable selection: Start with only the intercept and no predictors. Create a simple linear regression for each predictor. Add the one with the lowest RSS to the model. Continue until our chosen 'best model' measurement begins to decrease.

Answer 3

Start with all the predictors. Remove the predictor with the largest p-value, continue until all predictors are considered significant.

Answer 4

Start with no predictors. Continue with forward selection until variable's p-value increases past some threshold. Continue until all significant.

Answer 5

``` Categories coded numerically for a response variable, fed into a continuous model assume a natural order, or hierarchy, to the response variable. For multiclass categorical data, this is super inappropriate. ```

Answer 6

LR predicts whether something is true or false, rather than a continuous value. Fits an S shape from 0 to 1. responses > .5 will be classified as 1, and responses

Answer 7

A measure of how much variance in the response data, is explained by a given predictor. Mouse weight: We have the weights of mice, and we want to find a predictor that explains the variation of the mouse weights (i.e. What is a good predictor of mouse weight?). We calculate the variance of the mean line of the mouse weights, and then find the variance of our linear regression. Using size as a predictor, the R^2 is .81, meaning that the size of the mouse accounts for 81% of the variability in the data.

Answer 8

DA focuses on maximizing the seperatibility among known categories, by reducing dimensionality.

Answer 9

A characteristic that enables classes or categories to be distinguished from each other.

Answer 10

LDA maximizes the seperatibility of known categories by creating a new axis, so the dimensions can be reduced. If we just have one variable predicting the categories of interest, then it is a number line, with the categories spread across, and we look for a value that maximizes the separation of the categories. We can do better if we add another predicting variable, but now it is in 2D. We can't just ignore one variable and project onto the other's axis, because we'd lost that info. LDA creates a axis (line through the data) at whichever angle maximizes the separation, and projects all the data onto that. Now we have a number line with better separation. Goal is to optimize the distance between means, and scatter. (imagine squishing the data points together until maximum separation For 3 or more categories, find the centroid for all data, , and maximize all the individual categories means distance from the centroid, while maximizing scatter distance.

Answer 11

Misclassification measure

Answer 12

"As the crow flies" straight line. Problematic for unsupervised models. This is because: If we are interested in the height and annual salaries of people, $61 change in miniscule, but a 61 cm change in height is HUGE, and it is misleading to call points on the graph "equidistant". The data needs to be standardized. Scale data to have mean=0 var=1. So that data is scaled appropriately.

Answer 13

Measures distance by only moving along the axes.

Answer 14

MD takes into account the covariance structure of the data. It is a standardized euclidean distance. (Takes into account curvature.

Answer 15

Ignore 0-0, only count 0-1, 1-0 for numerator, 0-1, 1-0, 1-1 for denominator.

Answer 16

Since data often has a mix of variable types, this computes pairwise distances. Ensure that each variables total distance is standardized between 0 and 1, then sum them up!

Answer 17

Clustering is a form of unsupervised learning. It's goal is to find groups in which observations are more similar to observations in their group, and more dissimilar to observations in other groups.

Answer 18

1. Start with all obvs in their own group. groups=n 2--Join the 2 closest observations (now n-1 groups) 3--Recalculate distances 4--repeat 2 and 3 until there is only one group. Answers the question, what would n groups look like, but doesn't tell us HOW MANY groups are in the data and what do they look like. Cons: nxn distance matrices must be calculated, and very computaional time consuming for large samples. Sensetive to distance and linkage type.

Answer 19

Used for calcing dist in hierarchical clustering with a group consisting of more than one obvs. Single linkage: dist from closest

Answer 20

The user specifies the number of groups they are looking for. ``` Randomly select k points in the data, these are the initial centroids. 2--Assign all obvs the class of their nearest centroid. We now have K groups 3--Calculate the mean of each group. These are the new centroids. 4--Repeat until nothing changes. ``` This works by finding by recursively finding the minimum within group sum of squared distances btw obvs and centroid. Pros: Computationally efficient on large data sets.--Only n x k distance matrices needed. --Often provides clearer groups than Hierarchicall Cons: Since initial centroids are randomly selected, results can be different each time. (local opt rather than global.) --Groups are found NO MATTER WHAT. There might be no groups at all but still finds some.

Answer 21

Regression: Main goal is to minimize the mean squared error for the model MSE. Predicts LONG RUN MSE. Classification: minimize/balance classification errors. Most obvious option is to randomly split our data into training/testing. But what %? We want to fit, and test our model on as much data as possible. CV is used as a systematic approach for selection among possible models. Very important!!!

Answer 22

Leave one out cross validation is a systematic way to create multiple validation sets. Create n training sets of size n-1, where each set has 1 obvs removed. This also leaves us with n validations of size 1. We then predict the ith left out value using the ith model. Pros: Less bias in estimating long run error. (doesn't over estimate as much)--Assuming a deterministic model fit the LOOCV estimate of error is deterministic (never changes). Cons: Requires n model fittings - Problem for large n values.--

Answer 23

Randomly subdivide the data in K equal, non over lapping sets. These sets are the validation/testing sets, and the rest is the training set. Then calc MSE. Pros: (k=5 or 10 are common choices) Less model fitting than LOOCV (k vs n)--Less variance Cons: More bias than LOOCV (smaller sample size at each model fit.) --Non-deterministic estimate, since validation sets are being randomly selected. Different results each time. (bias/var trade off) LOOCV is deterministic because every single n value will be removed for each set, every time. Not random.

MT1 Flashcards

(47 cards)