Statistical Learning Methods Flashcards

1
Q

mean

A

arithmetic mean

-> greek letter “mu”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

median

A

The median of a set of data is the middlemost number in the set. The median is also the number that is halfway into the set. To find the median, the data should first be arranged in order from least to greatest. If there is an even number of items in the data set, then the median is found by taking the mean (average) of the two middlemost numbers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

standard deviation

A

Deviation just means how far from the normal

  • is just the square root of Variance
  • while Variance gives you a rough idea of spread, the standard deviation is more concrete, giving you exact distances from the mean
  • The standard deviation is an especially useful measure of variability when the distribution is normal or approximately normal because the proportion of the distribution within a given number of standard deviations from the mean can be calculated. For example, 68% of the distribution is within one standard deviation of the mean and approximately 95% of the distribution is within two standard deviations of the mean. herefore, if you had a normal distribution with a mean of 50 and a standard deviation of 10, then 68% of the distribution would be between 50 - 10 = 40 and 50 +10 =60. Similarly, about 95% of the distribution would be between 50 - 2 x 10 = 30 and 50 + 2 x 10 = 70.

=> a measure of the spread of the values around the mean

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sample

A

a selection taken from a bigger Population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

normal distribution

A

It is a continuous, bell-­shaped distribution (single peak)
which is symmetric about its mean and can take on values from negative infinity to positive infinity

Each normal curve is characterized by two parameters (is completely described by):

  • the mean
  • the standard deviation (its symbol is the greek letter sigma)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

variance

A

measures how far a data set is spread out.
The technical definition is “The average of the squared differences from the mean,” but all it really does is to give you a very general idea of the spread of your data.
A value of zero means that there is no variability: All the numbers in the data set are the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

population

A

a sample is a part of a population. A population is a whole, it’s every member of a group.
A population is the opposite to a sample, which is a fraction or percentage of a group. Sometimes it’s possible to survey every member of a group. A classic example is the U.S. Census, where it’s the law that you have to respond. If you do manage to survey everyone, it actually is called a census: The U.S. Census is just one example of a census.
In most cases, it’s impractical to survey everyone. In addition, sometimes people either don’t want to respond or forget to respond, leading to incomplete censuses. Incomplete censuses become samples by definition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

p-value

A

probability value: probability to get the sample result if the null hypothesis is assumed to be true -> if very small, then H_0 should most likely be rejected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

test statistic

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Gaussian distribution

A

= Normal distribution = bell curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

simple linear regression

A

s

predict quantitative values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

least squares method

A

s

estimate coefficients β_0, β_1, …, β_p st RSS is minimized (i.e. has the smallest possible value)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

intercept

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Linear models

A

Linear models describe a continuous response variable as a function of one or more predictor variables. They can help you understand and predict the behavior of complex systems or analyze experimental, financial, and biological data. Linear regression is a statistical method used to create a linear model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

residuals

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

qualitative response

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

quantitative response

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

regression

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

classification

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

prediction

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

inference

A
  • when inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods
  • In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest
  • if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately—interpretability is not a concern
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

prediction vs inference

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

parametric

A

reduces the problem of estimating function f down to one of estimating a set of parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

overfitting (the data)

A

fitting a too flexible model can lead to overfitting the data, i.e. the model follows the errors, or noise, too closely
- As model flexibility increases, training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data.
- This happens because our statistical learning
procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f. When we overfit the training data, the test MSE will be very large because the supposed patterns that the method found in the training data simply don’t exist in the test data. Note that regardless of whether or not overfitting has
occurred, we almost always expect the training MSE to be smaller than the test MSE because most statistical learning methods either directly or indirectly seek to minimize the training MSE. Overfitting refers specifically
to the case in which a less flexible model would have yielded a smaller test MSE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

parametric methods

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

non-parametric methods

A
  • do not make explicit assumptions about the functional
    form of f
  • by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f
  • a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f
  • We will often obtain more accurate predictions using
    a less flexible method. This phenomenon, which may seem counterintuitive at first glance, has to do with the potential for overfitting in highly flexible methods
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

supervised learning

A

We wish to fit a model that relates the response to the
predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

unsupervised learning

A

for every observation i = 1, . . . , n, we observe
a vector of measurements xi but no associated response yi. It is not possible to fit a linear regression model, since there is no response variable to predict. In this setting, we are in some sense working blind; the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis

  • > two types of unsupervised learning:
    1) principal components analysis
    2) clustering

=> in unsupervised learning, there is no way to check our work because we don’t know the true answer—the problem is unsupervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

variables

A

can be characterized as either quantitative or qualitative (also quantitative qualitative known as categorical)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

regression problems

A

problems with a quantitative response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

classification problems

A

problems with a qualitative response

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

degrees of freedom

A

a quantity that summarizes the flexibility of a curve
- a more restricted and hence smoother curve has fewer degrees of freedom than a wiggly curve
-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

variance of a statistical learning method

A

Variance refers to the amount by which fˆ would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different fˆ.
But ideally the estimate for f should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in fˆ.
In general, more flexible statistical methods have higher variance

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

bias of a statistical learning method

A

bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.

  • It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of f
  • Generally, more flexible methods result in less bias.

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

training error rate vs test error rate

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

logistic function

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

maximum likelihood

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

student distribution vs normal distribution

A

s

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

t-test

A

s

40
Q

RSS

A
  • > residual sum of squares / sum of squared residuals
  • > the least squares regression seeks coefficients β_0, β_1, …, β_p s.t. the sum of squared residuals is as small as possible
41
Q

R-square

A

s

42
Q

MSE

A

s

43
Q

t_0 statistic

A

s

44
Q

bias vs. variance

A

s

45
Q

stratification

A

s

46
Q

t-statistic vs t-test

A

s

47
Q

confidence interval

A

s

48
Q

covariance

A

s

49
Q

Support Vector machines (SVM)

A
  • classifier
50
Q

PCA

A

s

51
Q

why normalize data?

A

s

52
Q

rejection of outliers

A
  • Value incompatible with the variable domain
  • Value too large or too small according to the domain
  • Value larger or smaller the mean +/-­ 3 times the standard deviation
  • Coding problem (a string instead of a real value
53
Q

10-CV

A

s

54
Q

bias-variance trade-off

A

p 36

as we use more flexible methods, the variance will
increase and the bias will decrease

55
Q

covariance

A

s

56
Q

normal distribution

A

A normal distribution is fully described with just two

parameters: its mean (μ) and standard deviation.

57
Q

outliers

A

e.g. when value larger or smaller the mean +/-­ 3 times the standard deviation

58
Q

population regression line

A

the best linear approximation to the true relationship between X and Y as in
Y = β_0 + β_1X + e.

the least squares plane is an estimate for the true population regression plane

59
Q

RSE

A

residual standard error

it is the estimate of standard deviation

60
Q

t-statistic

A

s

61
Q

normalize data

A

s

62
Q

variability

A

Schwankungen
Variability refers to how “spread out” a group of scores is
-> refers to how spread out a distribution is

There are four frequently used measures of variability: the range, interquartile range, variance, and standard deviation

63
Q

correlation matrix

A

s

64
Q

F-statistic

A

s

65
Q

quantitative vs qualitative predictors

A

s

66
Q

outlier

A

An outlier is a point for which y_i is far from the value predicted by the outlier model. Outliers can arise for a variety of reasons, such as incorrect recording
of an observation during data collection.

67
Q

cross-validation

A

s

68
Q

validation set approach

A

s

69
Q

maximal margin classifier

A
  • In order to construct a classifier based upon a separating hyperplane, we must have a reasonable way to decide which of the infinite possible separating hyperplanes to use
    -> A natural choice is the maximal margin hyperplane which is the separating hyperplane that is farthest from the training observations. That is, we can compute the distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal
    margin hyperplane is the separating hyperplane for which the margin is largest—that is, it is the hyperplane that has the farthest minimum distance to the training observations. We can then classify a test observation
    based on which side of the maximal margin hyperplane it lies -> maximal margin classifier
  • > the maximal margin hyperplane depends directly on only a small subset of the observations!!
  • > the generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier

-> the distance of an observation from the
hyperplane can be seen as a measure of our confidence that the observation was correctly classified

70
Q

hyperplane

A

hyperplane is a flat affine subspace of dimension p-1 in a p-dimensional space

  • > it divides a p-dimensional space into two halves
  • > determine where a point lies by calculating the sign of the left hand side of (9.2)
71
Q

concept of separating hyperplane

A

the goal is to develop a classifier based on the training data that will correctly classify the test observation using its feature measurements

  • In order to construct a classifier based upon a separating hyperplane, we must have a reasonable way to decide which of the infinite possible separating hyperplanes to use
    -> A natural choice is the maximal margin hyperplane which is the separating hyperplane that is farthest from the training observations. That is, we can compute the distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal
    margin hyperplane is the separating hyperplane for which the margin is largest—that is, it is the hyperplane that has the farthest minimum distance to the training observations. We can then classify a test observation
    based on which side of the maximal margin hyperplane it lies -> maximal margin classifier
72
Q

support vectors

A
  • > Observations that lie directly on the margin, or on the wrong side of the margin for their class. These observations do affect the support vector classifier.
  • since support vectors are vectors in p-dimensional space and they “support” the maximal margin hyperplane in the sense that if these points were moved slightly then the maximal margin hyperplane would move as well
  • > for SVM only the support vectors are relevant as for the other training observations alpha_i is zero
73
Q

soft margin

A

s

74
Q

support vector classifier

A
  • > the generalization of the maximal margin classifier to the non-separable case is known as the support vector classifier
  • > also called “soft margin classifier”
  • > The hyperplane is chosen to correctly separate most of the training observations into the two classes, but may misclassify a few observations
  • the support vector classifier’s decision rule is based only on a potentially small subset of the training observations (the support vectors) means that it is quite robust to the behavior of observations that are far away from the hyperplane
  • > is a linear classifier
  • > is a natural approach for classification in the two-class setting, if the boundary between two classes is linear
  • > what to do with a non-linear class boundary?

-> note that the support vector classifier is equivalent to a SVM using a polynomial kernel of degree d = 1

75
Q

decision rule

A

s

76
Q

Type I error

A

a false positive conclusion
The rejection of the hypothesis H_0 which is true

und genauglich isch wenn me H_0 ablehnt (nei zu H_0 seit) aber in würklechkeit H_0 true isch, denn isches e false positive (e type I error) gemäss definition u nid e false negative

Practically, the type I error can be interpreted as the
probability of deciding that a significant effect is
present (reject H0) when it isn’t (H0 true).

Why?
The sample tends to demonstrate a significant effect
but it is due to random variability.
The sampling provides an extreme (but still possible)
sample.

77
Q

Type II error

A

a false negative conclusion
The acceptance of the hypothesis H0 which is false
(and so H1 true).

u zwar wenn du H_0 acceptisch (du seisch ja zu H_0) aber in würklechkeit H_0 falsch isch, isches äbe kei false positive sondern e false negative (e type II error) gemäss definition

Practically, the type II error can be interpreted as the
probability of not detecting a significant effect
(accept H0) when one exists (H0 false).

Why?
The true effect (H1) is too close to the H0 effect.
The effect is too small to be detected.
The sample size is small to detect the difference.

78
Q

LDA

A

linear discriminant analysis
-> approximates the Bayes classifier

  • why another method aside LR?
  • > if n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is more stable than the logistic regression model
  • > LDA is more popular for more than 2 response classes

-> the LDA classifier results from assuming that the observations within each class come from a normal distribution with a class-specific mean vector and a common variance σ^2, and plugging estimates for these parameters into the Bayes classifier

79
Q

LR

A

logistic regression

  • > logistic regression models the probability that Y belongs to a particular category
  • > uses logistic function
  • > to fit that model use a method called maximum likelihood
80
Q

classifiers

A
  • logistic regression
  • linear discriminant analysis
  • k-nearest neighbours
  • support vector classifier
81
Q

supervised vs. unsupervised learning

A

supervised: regression and classification
- > goal: predict Y using X_1, X_2,…,X_p given p features and n observations

unsupervised: only a set of features X_1, X_2,…,X_p available measured on n observations.
- > goal: not interested in prediction (don’t have Y), goal is to discover interesting things about the measurements on X_1, X_2,…,X_p

two types of unsupervised learning:

1) principal components analysis
2) clustering

82
Q

parametric vs non-parametric methods

A

parametric method: boils the problem of estimating f down to estimating a set of parameters

83
Q

P(default = Yes|balance)

A

the probability of default given a value for balance

-> model it using a function that gives outputs between 0 and 1 for all values of X -> use the logistic function

84
Q

logistic function

A

s

85
Q

Bayes theorem

A

s

86
Q

Bayes classifer

A

cond. probability: Pr(Y=j | X = x_0)
- it assign each observation to the most likley class given its predictor values, i.e. it assigns the test observation with predictor vector x_0 to class j for which Pr(Y=j | X = x_0) is largest
- in a two-class problem, the Bayes classifier corresponds to predicting class 1 if Pr(Y=1 | X = x_0) > 0.5, and class 2 otherwise.

=> minimizes the probability of misclassification
-> the Bayes classifier is a useful benchmark in statistical classification

87
Q

Bayes error rate

A
the Bayes classifier produces the lowest possible test error rate, called the Bayes error rate.
Since the Bayes classifier will always choose the class j for which Pr(Y=j | X = x_0) is largest, the error rate at X = x_0 will be 
1 - max_j Pr(Y=j | X = x_0)
88
Q

linear vs non-linear classifier

A
  • a linear classifier has a linear decision boundary

- a non-linear classifier has a non-linear decision boundary

89
Q

how to address non-linearity?

A

cf chap 7?

90
Q

support vector machine

A
  • extension of support vector classifier
  • > enlarge the feature space of support vector classifier to allow for non-linear boundaries between classes using KERNELS

=> When the support vector classifier is combined with a non-linear kernel such as (9.22) p 352, the resulting classifier is known as a support vector machine

  • mainly for binary classification i.e. classification in the two-class setting
  • > the concept of separating hyperplanes upon which SVMs are based does not lend itself naturally to more than two classes
  • but two approaches for the K-class case exist:
    1) one-versus-one
    2) one-versus-all
91
Q

kernel

A

A kernel is function that quantifies the similarity of two observations.

  • linear kernel
  • polynomial kernel -> leads to a much more flexible decision boundary
  • radial kernel

=> When the support vector classifier is combined with a non-linear kernel such as (9.22) p 352, the resulting classifier is known as a support vector machine

92
Q

ROC curve

A

s

p. 148

93
Q

deviance

A

s

94
Q

Principal component analysis (PCA)

A

PCA is a popular approach for deriving a low-dimensional set of features from a large set of variables

refers to the process by which principal
components are computed, and the subsequent use of these components in understanding the data. PCA is an unsupervised approach, since it involves only a set of features X1,X2, . . . , Xp, and no associated response
Y

  • dimensional reduction
  • pre-­process for the linear regression
  • use to extract pattern or interpretation

Unsupervised approach
not target signal
more difficult to evaluate
but could be useful to understand the data!

95
Q

LDA vs QDA

A

Why does it matter whether or not we assume that the K classes share a common covariance matrix? In other words, why would one prefer LDA to QDA, or vice-versa? The answer lies in the bias-variance trade-off.

LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing
variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly
untenable.