Statistical Learning Methods Flashcards

Question

parametric methods

Answer 1

- do not make explicit assumptions about the functional form of f - by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f - a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f - We will often obtain more accurate predictions using a less flexible method. This phenomenon, which may seem counterintuitive at first glance, has to do with the potential for overfitting in highly flexible methods

Answer 2

We wish to fit a model that relates the response to the predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).

Answer 3

for every observation i = 1, . . . , n, we observe a vector of measurements xi but no associated response yi. It is not possible to fit a linear regression model, since there is no response variable to predict. In this setting, we are in some sense working blind; the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis - > two types of unsupervised learning: 1) principal components analysis 2) clustering => in unsupervised learning, there is no way to check our work because we don’t know the true answer—the problem is unsupervised

Answer 4

can be characterized as either quantitative or qualitative (also quantitative qualitative known as categorical)

Answer 5

problems with a quantitative response

Answer 6

problems with a qualitative response

Answer 7

a quantity that summarizes the flexibility of a curve - a more restricted and hence smoother curve has fewer degrees of freedom than a wiggly curve -

Answer 8

Variance refers to the amount by which fˆ would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different fˆ. But ideally the estimate for f should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in fˆ. In general, more flexible statistical methods have higher variance As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease

Answer 9

bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. - It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of f - Generally, more flexible methods result in less bias. As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease

Answer 10

- > residual sum of squares / sum of squared residuals - > the least squares regression seeks coefficients β_0, β_1, ..., β_p s.t. the sum of squared residuals is as small as possible

Answer 11

- classifier

Answer 12

- Value incompatible with the variable domain - Value too large or too small according to the domain - Value larger or smaller the mean +/- 3 times the standard deviation - Coding problem (a string instead of a real value

Answer 13

p 36 as we use more flexible methods, the variance will increase and the bias will decrease

Answer 14

A normal distribution is fully described with just two | parameters: its mean (μ) and standard deviation.

Answer 15

e.g. when value larger or smaller the mean +/- 3 times the standard deviation

Answer 16

the best linear approximation to the true relationship between X and Y as in Y = β_0 + β_1X + e. the least squares plane is an estimate for the true population regression plane

Answer 17

residual standard error | it is the estimate of standard deviation

Answer 18

Schwankungen Variability refers to how "spread out" a group of scores is -> refers to how spread out a distribution is There are four frequently used measures of variability: the range, interquartile range, variance, and standard deviation

Answer 19

An outlier is a point for which y_i is far from the value predicted by the outlier model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection.

Answer 20

- In order to construct a classifier based upon a separating hyperplane, we must have a reasonable way to decide which of the infinite possible separating hyperplanes to use -> A natural choice is the maximal margin hyperplane which is the separating hyperplane that is farthest from the training observations. That is, we can compute the distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal margin hyperplane is the separating hyperplane for which the margin is largest—that is, it is the hyperplane that has the farthest minimum distance to the training observations. We can then classify a test observation based on which side of the maximal margin hyperplane it lies -> maximal margin classifier - > the maximal margin hyperplane depends directly on only a small subset of the observations!! - > the generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier -> the distance of an observation from the hyperplane can be seen as a measure of our confidence that the observation was correctly classified

Answer 21

hyperplane is a flat affine subspace of dimension p-1 in a p-dimensional space - > it divides a p-dimensional space into two halves - > determine where a point lies by calculating the sign of the left hand side of (9.2)

Answer 22

the goal is to develop a classifier based on the training data that will correctly classify the test observation using its feature measurements - In order to construct a classifier based upon a separating hyperplane, we must have a reasonable way to decide which of the infinite possible separating hyperplanes to use -> A natural choice is the maximal margin hyperplane which is the separating hyperplane that is farthest from the training observations. That is, we can compute the distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal margin hyperplane is the separating hyperplane for which the margin is largest—that is, it is the hyperplane that has the farthest minimum distance to the training observations. We can then classify a test observation based on which side of the maximal margin hyperplane it lies -> maximal margin classifier

Answer 23

- > Observations that lie directly on the margin, or on the wrong side of the margin for their class. These observations do affect the support vector classifier. - since support vectors are vectors in p-dimensional space and they “support” the maximal margin hyperplane in the sense that if these points were moved slightly then the maximal margin hyperplane would move as well - > for SVM only the support vectors are relevant as for the other training observations alpha_i is zero

Answer 24

- > the generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier - > also called "soft margin classifier" - > The hyperplane is chosen to correctly separate most of the training observations into the two classes, but may misclassify a few observations - the support vector classifier’s decision rule is based only on a potentially small subset of the training observations (the support vectors) means that it is quite robust to the behavior of observations that are far away from the hyperplane - > is a linear classifier - > is a natural approach for classification in the two-class setting, if the boundary between two classes is linear - > what to do with a non-linear class boundary? -> note that the support vector classifier is equivalent to a SVM using a polynomial kernel of degree d = 1

Answer 25

a false positive conclusion The rejection of the hypothesis H_0 which is true und genauglich isch wenn me H_0 ablehnt (nei zu H_0 seit) aber in würklechkeit H_0 true isch, denn isches e false positive (e type I error) gemäss definition u nid e false negative Practically, the type I error can be interpreted as the probability of deciding that a significant effect is present (reject H0) when it isn’t (H0 true). Why? The sample tends to demonstrate a significant effect but it is due to random variability. The sampling provides an extreme (but still possible) sample.

Answer 26

a false negative conclusion The acceptance of the hypothesis H0 which is false (and so H1 true). u zwar wenn du H_0 acceptisch (du seisch ja zu H_0) aber in würklechkeit H_0 falsch isch, isches äbe kei false positive sondern e false negative (e type II error) gemäss definition Practically, the type II error can be interpreted as the probability of not detecting a significant effect (accept H0) when one exists (H0 false). Why? The true effect (H1) is too close to the H0 effect. The effect is too small to be detected. The sample size is small to detect the difference.

Answer 27

linear discriminant analysis -> approximates the Bayes classifier - why another method aside LR? - > if n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is more stable than the logistic regression model - > LDA is more popular for more than 2 response classes -> the LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vector and a common variance σ^2, and plugging estimates for these parameters into the Bayes classifier

Answer 28

logistic regression - > logistic regression models the probability that Y belongs to a particular category - > uses logistic function - > to fit that model use a method called maximum likelihood

Answer 29

- logistic regression - linear discriminant analysis - k-nearest neighbours - support vector classifier

Answer 30

supervised: regression and classification - > goal: predict Y using X_1, X_2,...,X_p given p features and n observations unsupervised: only a set of features X_1, X_2,...,X_p available measured on n observations. - > goal: not interested in prediction (don't have Y), goal is to discover interesting things about the measurements on X_1, X_2,...,X_p two types of unsupervised learning: 1) principal components analysis 2) clustering

Answer 31

parametric method: boils the problem of estimating f down to estimating a set of parameters

Answer 32

the probability of default given a value for balance -> model it using a function that gives outputs between 0 and 1 for all values of X -> use the logistic function

Answer 33

cond. probability: Pr(Y=j | X = x_0) - it assign each observation to the most likley class given its predictor values, i.e. it assigns the test observation with predictor vector x_0 to class j for which Pr(Y=j | X = x_0) is largest - in a two-class problem, the Bayes classifier corresponds to predicting class 1 if Pr(Y=1 | X = x_0) > 0.5, and class 2 otherwise. => minimizes the probability of misclassification -> the Bayes classifier is a useful benchmark in statistical classification

Answer 34

``` the Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. Since the Bayes classifier will always choose the class j for which Pr(Y=j | X = x_0) is largest, the error rate at X = x_0 will be 1 - max_j Pr(Y=j | X = x_0) ```

Answer 35

- a linear classifier has a linear decision boundary | - a non-linear classifier has a non-linear decision boundary

Answer 36

cf chap 7?

Answer 37

- extension of support vector classifier - > enlarge the feature space of support vector classifier to allow for non-linear boundaries between classes using KERNELS => When the support vector classifier is combined with a non-linear kernel such as (9.22) p 352, the resulting classifier is known as a support vector machine - mainly for binary classification i.e. classification in the two-class setting - > the concept of separating hyperplanes upon which SVMs are based does not lend itself naturally to more than two classes - but two approaches for the K-class case exist: 1) one-versus-one 2) one-versus-all

Answer 38

A kernel is function that quantifies the similarity of two observations. - linear kernel - polynomial kernel -> leads to a much more flexible decision boundary - radial kernel => When the support vector classifier is combined with a non-linear kernel such as (9.22) p 352, the resulting classifier is known as a support vector machine

Answer 39

s | p. 148

Answer 40

PCA is a popular approach for deriving a low-dimensional set of features from a large set of variables refers to the process by which principal components are computed, and the subsequent use of these components in understanding the data. PCA is an unsupervised approach, since it involves only a set of features X1,X2, . . . , Xp, and no associated response Y - dimensional reduction - pre-process for the linear regression - use to extract pattern or interpretation Unsupervised approach not target signal more difficult to evaluate but could be useful to understand the data!

Answer 41

Why does it matter whether or not we assume that the K classes share a common covariance matrix? In other words, why would one prefer LDA to QDA, or vice-versa? The answer lies in the bias-variance trade-off. LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable.

Statistical Learning Methods Flashcards

(95 cards)