Statistical Learning Methods Flashcards
mean
arithmetic mean
-> greek letter “mu”
median
The median of a set of data is the middlemost number in the set. The median is also the number that is halfway into the set. To find the median, the data should first be arranged in order from least to greatest. If there is an even number of items in the data set, then the median is found by taking the mean (average) of the two middlemost numbers.
standard deviation
Deviation just means how far from the normal
- is just the square root of Variance
- while Variance gives you a rough idea of spread, the standard deviation is more concrete, giving you exact distances from the mean
- The standard deviation is an especially useful measure of variability when the distribution is normal or approximately normal because the proportion of the distribution within a given number of standard deviations from the mean can be calculated. For example, 68% of the distribution is within one standard deviation of the mean and approximately 95% of the distribution is within two standard deviations of the mean. herefore, if you had a normal distribution with a mean of 50 and a standard deviation of 10, then 68% of the distribution would be between 50 - 10 = 40 and 50 +10 =60. Similarly, about 95% of the distribution would be between 50 - 2 x 10 = 30 and 50 + 2 x 10 = 70.
=> a measure of the spread of the values around the mean
Sample
a selection taken from a bigger Population
normal distribution
It is a continuous, bell-shaped distribution (single peak)
which is symmetric about its mean and can take on values from negative infinity to positive infinity
Each normal curve is characterized by two parameters (is completely described by):
- the mean
- the standard deviation (its symbol is the greek letter sigma)
variance
measures how far a data set is spread out.
The technical definition is “The average of the squared differences from the mean,” but all it really does is to give you a very general idea of the spread of your data.
A value of zero means that there is no variability: All the numbers in the data set are the same.
population
a sample is a part of a population. A population is a whole, it’s every member of a group.
A population is the opposite to a sample, which is a fraction or percentage of a group. Sometimes it’s possible to survey every member of a group. A classic example is the U.S. Census, where it’s the law that you have to respond. If you do manage to survey everyone, it actually is called a census: The U.S. Census is just one example of a census.
In most cases, it’s impractical to survey everyone. In addition, sometimes people either don’t want to respond or forget to respond, leading to incomplete censuses. Incomplete censuses become samples by definition.
p-value
probability value: probability to get the sample result if the null hypothesis is assumed to be true -> if very small, then H_0 should most likely be rejected
test statistic
s
Gaussian distribution
= Normal distribution = bell curve
simple linear regression
s
predict quantitative values
least squares method
s
estimate coefficients β_0, β_1, …, β_p st RSS is minimized (i.e. has the smallest possible value)
intercept
s
Linear models
Linear models describe a continuous response variable as a function of one or more predictor variables. They can help you understand and predict the behavior of complex systems or analyze experimental, financial, and biological data. Linear regression is a statistical method used to create a linear model.
residuals
s
qualitative response
s
quantitative response
s
regression
s
classification
s
prediction
s
inference
- when inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods
- In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest
- if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately—interpretability is not a concern
prediction vs inference
s
parametric
reduces the problem of estimating function f down to one of estimating a set of parameters
overfitting (the data)
fitting a too flexible model can lead to overfitting the data, i.e. the model follows the errors, or noise, too closely
- As model flexibility increases, training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data.
- This happens because our statistical learning
procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f. When we overfit the training data, the test MSE will be very large because the supposed patterns that the method found in the training data simply don’t exist in the test data. Note that regardless of whether or not overfitting has
occurred, we almost always expect the training MSE to be smaller than the test MSE because most statistical learning methods either directly or indirectly seek to minimize the training MSE. Overfitting refers specifically
to the case in which a less flexible model would have yielded a smaller test MSE.
parametric methods
s
non-parametric methods
- do not make explicit assumptions about the functional
form of f - by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f
- a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f
- We will often obtain more accurate predictions using
a less flexible method. This phenomenon, which may seem counterintuitive at first glance, has to do with the potential for overfitting in highly flexible methods
supervised learning
We wish to fit a model that relates the response to the
predictors, with the aim of accurately predicting the response for future observations (prediction) or better understanding the relationship between the response and the predictors (inference).
unsupervised learning
for every observation i = 1, . . . , n, we observe
a vector of measurements xi but no associated response yi. It is not possible to fit a linear regression model, since there is no response variable to predict. In this setting, we are in some sense working blind; the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis
- > two types of unsupervised learning:
1) principal components analysis
2) clustering
=> in unsupervised learning, there is no way to check our work because we don’t know the true answer—the problem is unsupervised
variables
can be characterized as either quantitative or qualitative (also quantitative qualitative known as categorical)
regression problems
problems with a quantitative response
classification problems
problems with a qualitative response
degrees of freedom
a quantity that summarizes the flexibility of a curve
- a more restricted and hence smoother curve has fewer degrees of freedom than a wiggly curve
-
variance of a statistical learning method
Variance refers to the amount by which fˆ would change if we estimated it using a different training data set. Since the training data are used to fit the statistical learning method, different training data sets will result in a different fˆ.
But ideally the estimate for f should not vary too much between training sets. However, if a method has high variance then small changes in the training data can result in large changes in fˆ.
In general, more flexible statistical methods have higher variance
As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease
bias of a statistical learning method
bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.
- It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of f
- Generally, more flexible methods result in less bias.
As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease
training error rate vs test error rate
s
logistic function
s
maximum likelihood
s
student distribution vs normal distribution
s