Week 6: Regression and Classification Flashcards

Question

When selecting a level of smoothness for non-parametric methods, what is the trade-off?

Answer 1

Low-levels of smoothness can lead to overfitting

Answer 2

f\* will not be a perfect estimate for f and will cause some error, this error is reducible because we can potentially improve the accuracy of f

Answer 3

Y = B0 + B1X1 + B2X2 + … + BpXp + e

Answer 4

Measures the proportion of variability in Y that can be explained using X R^2 = (TSS-RSS)/TSS

Answer 5

TP/(TP+FP)

Answer 6

e: cannot be predicted using X, therefore the error introduced by e cannot be reduced

Answer 7

Compute the standard error of B0 and B1

Answer 8

2 variables: correlation matrix multiple variables: variance inflation factor (VIF). value exceeding 5 or 10 is problematic

Answer 9

Change the threshold at which an observation is assigned to a class - default is 0.5

Answer 10

Given a positive integer K and a test observation x0, the KNN classifier first indentifies the K points in the training data that are closest to x0, represented by N0. It then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j. KNN classifies the test observation x0 to the class with the largest probability.

Answer 11

Tend to overfit Use them as a basic building block for ensembles

Answer 12

false positive rate vs true positive rate. The area under the curve gives the overall performance of a classifier

Answer 13

Hypothesis test. Can compute t-statistic using standard errors. p-value is the probability of observing a value equal or larger than t. small p-value indicates unlikely to observe such a substantial association between the predictor and the response due to chance. If p-value is small, reject hypothesis that co-efficient is 0

Answer 14

It is still a linear model

Answer 15

less true positives and less false positives (ie less positives overall)

Answer 16

``` assigns an observation X=x to the class for which pk(x) is the largest (assigns each observation to the most likely class given predictor values) -produces the lowest possible test error rate, called the Bayes error rate ```

Answer 17

residual standard error: RSE = sqrt[(1 / n-2)\*RSS] It is the average amount that the response will deviate from the true regression line. It is an absolute measure of the lack of fit of the model

Answer 18

the estimated standard errors will be too low -\> unwanted sense of confidence in model

Answer 19

pred\_prob 0.5, labels = c(“No”, “yes”) table(true=, predicted = pred\_lr)

Answer 20

as p increases (more dimensions), a given observation has no nearby neighbours

Answer 21

An unusual value for xi for multiple regression: point that is unusual in terms of the full set of predictors

Answer 22

mean squared error MSE = 1/n SUM(y - predicted y)^2

Answer 23

- fk(x) is normal - there is a common variance across all K classes

Answer 24

TSS measures the total variance in the response Y before the regression is performed RSS measures the variability that is left unexplained after performing the regression TSS - RSS measures the amount of variability in the response that is explained by performing the regression

Answer 25

E(MSE) = bias-squared + variance + e

Answer 26

increasing X by one unit changes the log odds by B1

Answer 27

Cor(Y, Y\*)

Answer 28

predicting class one if Pr (Y = 1 | X = x0) \> 0.5

Answer 29

B0\* +- 2.SE(B0\*) where 2 is actually the 97.5% quantile of a t-distribution with n-2 degrees freedom

Answer 30

Use least squared approach to minimise RSS (residual sum of squares)

Answer 31

small training MSE but large test MSE

Answer 32

One less than the number of levels, because there is a baseline level with no dummy variable

Answer 33

Automatically outputs the log odds. To change it: predict(lr\_mod, type=“response”)

Answer 34

A very low K value means the decision boundary is overly flexible and finds patterns in the data that don’t correspond to the Bayes decision boundary. This classifier has low bias but very high variance A very high K value means the classifier becomes less flexible and produces a decision boundary that is close to linear. This is a low-variance but high-bias classifier

Answer 35

Develop an agent that improves its performance based on interactions with the environment

Answer 36

AutoML performs all the steps of comparing different models using cross-validation, choosing parameters automatically fit\_aml

Answer 37

Use dummy variables. 0 for one, 1 for other. Or -1 and 1

Answer 38

bias initially decreases faster than variance increases, so the MSE declines. But at some point increasing flexibility has more impact on the variance, so the MSE increases.

Answer 39

Given a value for K and a prediction point x0, KNN regression first identifies the K training observations that are closest to x0 represented by N0. Then it estimates f(x0) using the average of all the training responses in N0

Answer 40

irreducible and reducible error

Answer 41

- Prediction intervals are used to answer the question how much will Y vary from Y\* - Prediction intervals are always wider than confidence intervals because they incorporate both the reducible error and irreducible error

Answer 42

the percentage of Falses that are identified correctly = TN/(TN+FP)

Answer 43

(TP+TN)/(TP+FP+FN+FN)

Answer 44

- Additivity assumption: that the association between a predictor X and the response Y does not depend on the values of the other predictors - the error terms e1, e2, … are uncorrelated - the error terms have a constant variance, Var(ei) = sigma squared

Answer 45

ensemble of weak learners. Make trees that are too simple and make more of them for observations with big residuals, then average them

Answer 46

the error that is introduced by approximating a real-life problem which may be very complicated, by a simpler model. In general, more flexible methods result in less bias

Answer 47

There is a dataset, set of people trying to find prediction rule and a referee. The referee runs the prediction rule against a testing dataset, which is sequestered behind a Chinese wall The referee objectively and automatically reports the score achieved by the submitted rule Results in declining error rate

Answer 48

p(X)/(1-p(X)) = e^ B0+B1X

Answer 49

recursive partitioning . Find the split that makes observations as similar as possible on the outcome within that split. Do that again with each resulting group. Stop at stopping parameter

Answer 50

When there is a small number of observations per predictor

Answer 51

log(p(X)/(1-p(X))) = B0 + B1X

Answer 52

- Forward selection: begin with null model. Then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS. Continue adding variables until some stopping rule is satisfied - Backward selection: start with all variables, remove the variable with the largest p-value, continue removing variables until a stopping rule is reached - Mixed selection: combination of forward and backward. Start with no variables in the model. Add the variable that provides the best fit. Continue to add variables one by one. If at any point the p-value for one of the variables in the model rises above a certain threshold, then remove that variable from the model. Continue until all the variables in the model have a sufficiently low p-value and all variables outside the model would have a large p- value if added to the model

Answer 53

``` F1 score = sqrt(precision x recall) it is not affected by uneven class distributions ```

Answer 54

if we include an interaction in a model we should also include the main effects, even if the p-values associated with their coefficients are not significant

Answer 55

seq(0, 40, length.out = 1000)

Answer 56

P(A|B) = P(B|A) . P(A) / P(B)

Answer 57

Y = B0 + B1X

Answer 58

A model is perfectly calibrated if for any probability value p, a prediction of a class with confidence p is correct 100\*p percent of the time

Answer 59

Advantage: It has less bias because the training set is bigger Disadvantage: time consuming to implement

Answer 60

e is a random error term which is independent of X and has mean 0

Answer 61

For each observation xi, there is an associated response measurement yi

Answer 62

models the probability that Y belongs to a particular category

Answer 63

- validation estimate of the test-error rate can be highly variable depending on which observations are included in the training set and which are included in the validation set - not as many observations in the training set

Answer 64

ifelse(student==“Yes”, 1, 0)

Answer 65

Assume that X= (X1, …Xp) is drawn from multivariate normal distribution with a class-specific mean vector and common covariance matrix

Answer 66

Presence of a funnel shape in the residual plot. Transform Y to log(Y) or sqrt(Y)

Answer 67

TN/(TN+FN)

Answer 68

- only has to be fit k times compared to n times in LOOCV - variability in the test error estimate is lower than when using the validation set approach

Answer 69

predict(model, newdata = data)

Answer 70

For each observation we observe a vector of measurements xi but no response yi. Seek to understand relationship between observations

Answer 71

predicted probability versus observed proportion, should be a straight line with slope 1

Answer 72

Classifying a response variable with more than 2 classes

Answer 73

Plot the residuals as a function of time. Adjacent residuals may have similar values if they are correlated

Answer 74

Do no make explicit assumptions about the functional form of f

Answer 75

Plot the standardised residuals. Those with absolute values greater than 3 may be outliers

Answer 76

Collinearity reduces the accuracy of the estimates of the regression coefficients, and causes the standard error to grow

Answer 77

Instead of selecting a baseline classes, treat all K classes symmetrically. Estimate coefficients for all K classes

Answer 78

randomly divide the available set of observations into two parts, a training set and a validation set. The model is fit on the training set and the fitted model is used to predict the responses for the observations in the validation set

Answer 79

A single observation (x1, y1) is used for the validation set and the remaining observations make up the training set. Find the MSE and repeat this approach n times and get the average of the n MSE estimates

Answer 80

Advantage: have the potential to accurately fit a wider range of possible shapes for f Disadvantage: a very large number of observations is required to obtain an accurate estimate for f

Answer 81

bagged trees with feature sampling. Make trees that are too complex and average over bootstrapped samples to cancel out the overfitting parts

Answer 82

predict(model, newdata = data, interval = “confidence”)

Answer 83

Z-statistic

Answer 84

yi = yi\*. difference between observed and predicted response

Week 6: Regression and Classification Flashcards

(113 cards)