Statistical inference Flashcards
What is the difference between drawing values from a distribution with or without replacement?
With replacement means that after we have drawn a value we put it back and this way the draws does not change the distribution of values. Replacement is good to use when we want to simulate a big or infinite population where one draw does not change the probabilities of the next draw so much.
Without replacement means that we do not put the value back and the distribution changes with each draw. This should be used to simulate draws from a smaller finite population.
What is a probability density function?
When we make stochastic observations with real values the distribution of those values are described by probability density functions. If we were to draw samples from an urn 1000 times and create a histogram of the drawn samples, the shape of the histogram would follow the probability density function for that variable where the highest number of drawn values will be the mean value and the variation the standard deviation (if the variable is normally distributed).
How can we get the probability of drawing a value x from a mixture of multiple different distributions?
Having different distributions means that we have a mixture of pdfs and they have different means and variations.
When we have different object types from different distributions, the distribution we end up drawing from is the pdfs multiplied by their weight summed together.
p(x) = w1 * p1(x) + w2 * p2(x) + w3 * p3(x)
- Draw one sample from a discrete multiple class distributions with probabilities of the weights.
- If the outcome is class i, then draw a sample from pi.
What parameters do you need to define to draw from a univariate normal distribution vs a multivariate normal distribution?
What do the results look like?
Univariate:
We need to define the sd and the mean. Result is one value from the distribution.
Multivariate:
Vector of means for the different distributions and a covariance matrix that describes how the individual variables vary and how they vary together.
Result is a vector with as many values as the number of distributions you have.
Why is it important to incorporate noise when we simulate stochastic observations?
When we simulate stochastic observations we should take into account that real values usually have some noise that affects the values we draw. The noise is important to know about because it can make a linear relationship appear not to be linear if we have too few samples. The impact of the noise gets smaller with a higher number of samples.
What is a Hidden Markov model?
Markov models are used to simulate stochastic processes.
They are memoryless models (only looks at the present state) that are based on probabilities. The model will simulate the most probable process by using state and emission probabilities.
What is a conventional confidence interval?
Conventional confidence interval: This interval will contain the true mean with 95% chance and tells you that if you were to sample many times and construct 95% confidence intervals each time, the true mean will fall within that interval 95% of the time.
This because the conventional confidence interval is frequentistic There is still a 5% chance that the true mean falls on the “extreme values” not included in the confidence interval and 1/20 created confidence intervals might miss the true mean because of this.
Guaranteed coverage 1-alfa.
What is a bootstrap 95% percentile interval? What is the benefit and setbacks of using these intervals?
These are created for resampling. If you were to create many bootstrap datasets and calculate the mean of each and create a 95% percentile interval of the bootstrap means that interval will tell you nothing about if the true mean is inside that interval since it is not designed to have that coverage. It instead tells you how uncertain you are of your estimated mean in your samples.
The 2.5 and 97.5 percentiles of the bootstrap means indicate the lower and upper bounds of the percentile interval. These percentile intervals capture the middle 95% of the distribution, leaving 2.5% in the lower tail and 2.5% in the upper tail. If that interval is very wide it means that we should be more uncertain of our estimated mean.
The benefit is that with bootstrap intervals you do not need to worry about the distribution of the variable like you have to with conventional confidence intervals.
How do we get the 95% bootstrap percentile interval?
We draw bootstrap datasets from the original dataset, calculate the metric we are interested in for each bootstrap set and save in a list (could be correlations, mean-values, fractions, standard deviations ect.) Then calculate the 2.5 and 97.5 percentiles of the bootstrap values and those values are the upper and lower bounds of the percentile interval [a,b].
This interval tells you how certain you are of your estimated mean because it tells you how much it can vary just by random chance.
The bigger the bootstrap set is the more data we have to look at and this will lead to a smaller interval that gives us more certainty because then the bootstrap represents the original dataset better.
Why is it more efficient to create a bootstrap interval after you have collected your sample population?
We do bootstrap sets instead of doing the observations 1000 times and bootstrapping is also beneficial because we do not have to worry about the distribution of the variable we are looking at. A conventional confidence interval will crash if the variable is not normally distributed.
What is hypothesis testing?
To do a hypothesis testing you are testing how likely it is to get an observed value under the null hypothesis.
How is hypothesis testing done?
You are calculating your observed value.
Then creating bootstrap sets from the NULL distribution and calculating the metric for each bootstrap set. You are then calculating the p-value by summing how many times the bootstrap metrics were as extreme or more extreme as the observed value and dividing the sum with the number of bootstrap sets. If you saw very few extreme values (under 5% usually) the null is not likely and we should reject it.
How is hypothesis testing with percentile intervals done?
There we see if 0 (null) is inside of the 95% percentile interval, and if it is then the null is likely and we cannot reject it. This is usually only done when the null states that a difference between two groups is 0.
What is the interquartile range? What is the benefit of using this over standard deviation?
IQR = Q3 - Q1. Removes the lowest 25% and the highest 25% values and only looks at the median of the remaining 50%. This metric is more robust as a metric of variety than standard deviation because the standard deviation is more sensitive to outliers because it looks at the entire dataset.
What is regression?
Regression: The problem of using a set of training examples to build a prediction model
f() that produces predictions f(x) where the observed response values yn are real valued (continuous).
What is mean absolute error?
Mean absolute error = Sum of all absolute errors between true response and predicted response divided by the number of samples. So the average of the absolute errors.
What is mean squared error?
Mean squared error = Sum of each error squared / number of samples.
What is root mean squared error?
Root mean squared error = Square root of MSE.
What is the R^2 value?
How much of the variation in response is described by the features.
calculated by taking 1- (sum of squared errors / sum of all differences between yi and sample mean of y)
What is mean absolute percentage error?
Average of absolute value of ratio between absolute error and true value.
What is k-fold cross validation?
K-fold cross validation is a kind of resampling model where a model is trained and tested on training data and the training and testing is done on different subsets of the data.
Explain how k-fold cross validation is done?
The training dataset is divided into k number of subsets and the model is trained on k-1 subsets and tested on the one the subset that is excluded from training. This is done k number of times so that the model is trained and tested on all subsets of the training dataset and so that test is always done on data that the model was not trained on.
We then take the average of all k number of performance values to get a good estimate of how the model will perform on unseen data of the same size as the training set.
In k-fold cross validation, what are the pros and cons of using large vs small k?
If we were to use fewer folds, it would reduce the training time and computational cost but smaller K also leads to more overlap between training sets, because each training set will cover more of the dataset. This is potentially allowing the model to capture more consistent patterns in the data but with this we also get the risk of overfitting the model to the training data.
Larger K will therefore reduce the risk of overfitting the data because the overlap between training sets will be smaller but the computational cost does however get bigger with bigger K.
What is a hyper parameter? What is the hyperparameter of ridge regression and k-nearest neighbor regression?
A hyperparameter is a setting or configuration for a machine learning model that is set before the training process begins. Unlike the model parameters, which the algorithm learns from the training data, hyperparameters are external factors that influence how the learning process takes place.
The hyper parameter of ridge regression is the alfa penalty value and for k nearest neighbor it is the number of nearest neighbors we choose to look at (k).
What is ridge regression? What is the difference from OLS?
Ridge regression is similar to an ordinary least squares fitting in the sense that we are trying to find the coefficients of x so that the residuals are minimized. The difference is that with ridge regression we also have an alfa value that is described as a penalty for choosing large coefficients in the produced model.
If alfa = 0 then it is the same thing as an ordinary least squares fitting because there is no penalty and the model is free to distribute the coefficients so that one feature is much more important than another.
Why is ridge regression better if you have correlation between some of the features in your dataset?
If alfa is large then the model will be encouraged to distribute the coefficients equally across all features. This means that if we have multicollinearity, at least we reduce the risk of the correlated features being a lot more important for the change of the response value than the other features. The alfa also helps to reduce the risk for overfitting the model to the training data which is a risk with large coefficients.
Explain k-nearest neighbor regression.
It is a regression model where we predict the values of the response of a new observation by looking at the response values of the previously observed observations.
When we add a new observation, we assign to it the y-value of the closest observation. If k=1 we use the closest observation. If k=2 then we use the 2 closest observations and the average of their y-values.
In k-nearest neighbor regression, what happens if we set k to be equal to the number of observations we have?
If K is set to the same value as the number of observations we will just take the same average each time for the predicted response of the new observed x.