statistical_inference Flashcards

1
Q

What is the difference between drawing values from a distribution with or without replacement?

A

With replacement means that after we have drawn a value we put it back and this way the draws does not change the distribution of values. Replacement is good to use when we want to simulate a big or infinite population where one draw does not change the probabilities of the next draw so much. Without replacement means that we do not put the value back and the distribution changes with each draw. This should be used to simulate draws from a smaller finite population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a probability density function?

A

When we make stochastic observations with real values the distribution of those values are described by probability density functions. If we were to draw samples from an urn 1000 times and create a histogram of the drawn samples, the shape of the histogram would follow the probability density function for that variable where the highest number of drawn values will be the mean value and the variation the standard deviation (if the variable is normally distributed).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we get the probability of drawing a value x from a mixture of multiple different distributions?

A

Having different distributions means that we have a mixture of pdfs and they have different means and variations. When we have different object types from different distributions, the distribution we end up drawing from is the pdfs multiplied by their weight summed together.p(x) = w1 * p1(x) + w2 * p2(x) + w3 * p3(x)1. Draw one sample from a discrete multiple class distributions with probabilities of the weights. 2. If the outcome is class i, then draw a sample from pi.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What parameters do you need to define to draw from a univariate normal distribution vs a multivariate normal distribution?What do the results look like?

A

Univariate:We need to define the sd and the mean. Result is one value from the distribution.Multivariate:Vector of means for the different distributions and a covariance matrix that describes how the individual variables vary and how they vary together.Result is a vector with as many values as the number of distributions you have.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is it important to incorporate noise when we simulate stochastic observations?

A

When we simulate stochastic observations we should take into account that real values usually have some noise that affects the values we draw. The noise is important to know about because it can make a linear relationship appear not to be linear if we have too few samples. The impact of the noise gets smaller with a higher number of samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a Hidden Markov model?

A

Markov models are used to simulate stochastic processes. They are memoryless models (only looks at the present state) that are based on probabilities. The model will simulate the most probable process by using state and emission probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a conventional confidence interval?

A

Conventional confidence interval: This interval will contain the true mean with 95% chance and tells you that if you were to sample many times and construct 95% confidence intervals each time, the true mean will fall within that interval 95% of the time. This because the conventional confidence interval is frequentistic There is still a 5% chance that the true mean falls on the “extreme values” not included in the confidence interval and 1/20 created confidence intervals might miss the true mean because of this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a bootstrap 95% percentile interval? What is the benefit and setbacks of using these intervals?

A

These are created for resampling. If you were to create many bootstrap datasets and calculate the mean of each and create a 95% percentile interval of the bootstrap means that interval will tell you nothing about if the true mean is inside that interval since it is not designed to have that coverage. It instead tells you how uncertain you are of your estimated mean in your samples. The 2.5 and 97.5 percentiles of the bootstrap means indicate the lower and upper bounds of the percentile interval. These percentile intervals capture the middle 95% of the distribution, leaving 2.5% in the lower tail and 2.5% in the upper tail. If that interval is very wide it means that we should be more uncertain of our estimated mean.The benefit is that with bootstrap intervals you do not need to worry about the distribution of the variable like you have to with conventional confidence intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do we get the 95% bootstrap percentile interval?

A

We draw bootstrap datasets from the original dataset, calculate the metric we are interested in for each bootstrap set and save in a list (could be correlations, mean-values, fractions, standard deviations ect.) Then calculate the 2.5 and 97.5 percentiles of the bootstrap values and those values are the upper and lower bounds of the percentile interval [a,b].This interval tells you how certain you are of your estimated mean because it tells you how much it can vary just by random chance.

The bigger the bootstrap set is the more data we have to look at and this will lead to a smaller interval that gives us more certainty because then the bootstrap represents the original dataset better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is it more efficient to create a bootstrap interval after you have collected your sample population?

A

We do bootstrap sets instead of doing the observations 1000 times and bootstrapping is also beneficial because we do not have to worry about the distribution of the variable we are looking at. A conventional confidence interval will crash if the variable is not normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is hypothesis testing?

A

To do a hypothesis testing you are testing how likely it is to get an observed value under the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is hypothesis testing done?

A

You are calculating your observed value. Then creating bootstrap sets from the NULL distribution and calculating the metric for each bootstrap set. You are then calculating the p-value by summing how many times the bootstrap metrics were as extreme or more extreme as the observed value and dividing the sum with the number of bootstrap sets. If you saw very few extreme values (under 5% usually) the null is not likely and we should reject it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is hypothesis testing with percentile intervals done?

A

There we see if 0 (null) is inside of the 95% percentile interval, and if it is then the null is likely and we cannot reject it. This is usually only done when the null states that a difference between two groups is 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the interquartile range? What is the benefit of using this over standard deviation?

A

IQR = Q3 - Q1. Removes the lowest 25% and the highest 25% values and only looks at the median of the remaining 50%. This metric is more robust as a metric of variety than standard deviation because the standard deviation is more sensitive to outliers because it looks at the entire dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is regression?

A

Regression: The problem of using a set of training examples to build a prediction modelf() that produces predictions f(x) where the observed response values yn are real valued (continuous).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is mean absolute error?

A

Mean absolute error = Sum of all absolute errors between true response and predicted response divided by the number of samples. So the average of the absolute errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is mean squared error?

A

Mean squared error = Sum of each error squared / number of samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is root mean squared error?

A

Root mean squared error = Square root of MSE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the R^2 value?

A

How much of the variation in response is described by the features. calculated by taking 1- (sum of squared errors / sum of all differences between yi and sample mean of y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is mean absolute percentage error?

A

Average of absolute value of ratio between absolute error and true value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is k-fold cross validation?

A

K-fold cross validation is a kind of resampling model where a model is trained and tested on training data and the training and testing is done on different subsets of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Explain how k-fold cross validation is done?

A

The training dataset is divided into k number of subsets and the model is trained on k-1 subsets and tested on the one the subset that is excluded from training. This is done k number of times so that the model is trained and tested on all subsets of the training dataset and so that test is always done on data that the model was not trained on. We then take the average of all k number of performance values to get a good estimate of how the model will perform on unseen data of the same size as the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

In k-fold cross validation, what are the pros and cons of using large vs small k?

A

If we were to use fewer folds, it would reduce the training time and computational cost but smaller K also leads to more overlap between training sets, because each training set will cover more of the dataset. This is potentially allowing the model to capture more consistent patterns in the data but with this we also get the risk of overfitting the model to the training data.Larger K will therefore reduce the risk of overfitting the data because the overlap between training sets will be smaller but the computational cost does however get bigger with bigger K.

24
Q

What is a hyper parameter? What is the hyperparameter of ridge regression and k-nearest neighbor regression?

A

A hyperparameter is a setting or configuration for a machine learning model that is set before the training process begins. Unlike the model parameters, which the algorithm learns from the training data, hyperparameters are external factors that influence how the learning process takes place.The hyper parameter of ridge regression is the alfa penalty value and for k nearest neighbor it is the number of nearest neighbors we choose to look at (k).

25
Q

What is ridge regression? What is the difference from OLS?

A

Ridge regression is similar to an ordinary least squares fitting in the sense that we are trying to find the coefficients of x so that the residuals are minimized. The difference is that with ridge regression we also have an alfa value that is described as a penalty for choosing large coefficients in the produced model. If alfa = 0 then it is the same thing as an ordinary least squares fitting because there is no penalty and the model is free to distribute the coefficients so that one feature is much more important than another.

26
Q

Why is ridge regression better if you have correlation between some of the features in your dataset?

A

If alfa is large then the model will be encouraged to distribute the coefficients equally across all features. This means that if we have multicollinearity, at least we reduce the risk of the correlated features being a lot more important for the change of the response value than the other features. The alfa also helps to reduce the risk for overfitting the model to the training data which is a risk with large coefficients.

27
Q

Explain k-nearest neighbor regression.

A

It is a regression model where we predict the values of the response of a new observation by looking at the response values of the previously observed observations. When we add a new observation, we assign to it the y-value of the closest observation. If k=1 we use the closest observation. If k=2 then we use the 2 closest observations and the average of their y-values.

28
Q

In k-nearest neighbor regression, what happens if we set k to be equal to the number of observations we have?

A

If K is set to the same value as the number of observations we will just take the same average each time for the predicted response of the new observed x.

29
Q

What is the workflow of PyCaret?

A
  1. The dataset is divided into 70% training (Dtrain) and 30% testing (Dtest).
  2. Use the function compare_models() to fit all models in a family (here regression models) to Dtrain using k-fold cross validation with default K=10 and rank performance of the models where the performance value is the average of each fold.
  3. Tune the model to find the optimal set of hyper parameters that optimize the performance. The function tests different hyperparameters for the model and trains and tests them again using k-fold cross validation.
  4. The performance is only based on the training set. Therefore we use the function predict_model to let the model predict the responses on Dtest. When we look at the performance values from Dtest they should not be significantly lower than the mean of all cross validations since this would indicate that the model is overly fitted to Dtrain.
  5. Finalize_model() is a function that trains and evaluates the model again on the entire dataset. After this we let the model predict responses on completely unseen data.
30
Q

What is the setback of using the finalize_model function in PyCaret?

A

When training the final model on the entire dataset we have more data that could have meant different hyperparameters if that data was a part of the tuning.

This means that the hyperparameters as well as the model family might be suboptimal now that we use the entire dataset and have more information.Another problem is that we use up all data we have for training and we have no data left to treat as “unseen” to test the models performance on new data.

31
Q

What is the setback of using the compare_models function in PyCaret?

A

The problem here is that the function does not consider other hyperparameters than the defaults and therefore it might favor one model over another. The knn with default k might be better than ridge with default parameters, but ridge might be the better choice if we looked at another hyperparameter value. This means that the model chosen could be suboptimal.

32
Q

In experimental design, what does it mean to run a full factorial design?

A

Running a full factorial design means that all possible combinations of factor levels are tested. If you have a 2 level design meaning that you have 2 values that the variable can take on (maximum and minimum, -1 and 1) and 5 variables, then the full factorial design is 2^5=32 . So the number of unique experiments to run is k^p and when you run the full factorial design, all unique experiments are tested.

33
Q

What is a fractional factorial design? Why do we use it?

A

The full factorial design usually leads to a large number of experiments to perform. We can reduce the number of experiments by doing a fractional factorial design by treating the higher order variables as the product of the other variables. If we for example treat x5 as the product of x1x2x3*x4, we do a full factorial (test all unique combinations) for x1-x4 and we will instead get 2^5-1=16 experiments.

34
Q

What is the setback of using a fractional factorial design?

A

We will not know if x5 or the product of x1-x4 is the important feature if x5 got a big coefficient.

35
Q

How do we build a fractional factorial design?

A

To build the fractional factorial we use the factorial.build_factorial(5, 2**(5-1)) function from the dexpy library. The first 5 indicates that we have 5 independent variables, 2 indicates the level and 5-1 is the fractional factorial. From this we will get a table with all 5 variables as columns and the 16 experiments as rows.

36
Q

What are counterpoints in experimental design and why do we need them?

A

We add center points to the design if the model we want to fit has quadratic elements. This is because a 2 level design cannot capture the models deviations from a straight line since we do not have enough levels - we need at least 3 for a quadratic model. Therefore we add center points to the features that have real values. The center points added are extra experimental runs with values for the features that are the midpoints of the two existing levels.This will lead to an extra number of experiments that corresponds to the number of unique combinations we can add to the experiment.

37
Q

What does a plain linear model look like?

A

y = w0 + w1x1 + w2x2 + w3x3….. + wnxn.

38
Q

What does a full quadratic model with 2 variables look like?

A

y = w0 + w1x1 + w2x2 + w12x1x2 + w1x1x1 + w2x2x2.

39
Q

What is yˆ = w^Tz?

A

Any model we choose to fit to our data can be described by yˆ = w^Tz where:w is a vector of the coefficients.z is a vector/matrix of the x-variables/the experimental designy^ is the predicted response from the model.

40
Q

How can we use yˆ = w^Tz to find the coefficients?

A

We can find the coefficients if we have the x values and the response values by using ordinary least squares that minimizes the sum of the squared residuals (minimizes (y - y^)^2).Searching for the minimum gives the closed form:wOLS = (Z^TZ) ^-1 = Z^Ty which is the same thing as Z^TZ = Z^Tw.This means that for any model we choose to fit we can use the closed form to find the coefficients that minimizes the residuals.

41
Q

Why do we encode the values of the features in the design matrix?

A

The values in the design will be encoded values to get them on the same scale so that they are directly comparable. When we later want to perform the actual experiment we would need to use the actual lows and highs in the design matrix.

42
Q

What is D-optimal design?

A

Instead of using the fractional factorial design to reduce the number of experiments we can use the D-optimal design.This means that we use the k number of experiments that maximizes the determinant of the design matrix, meaning that we optimize the design.

43
Q

In D-optimal design, what does it mean that we find the experiments that maximizes the determinant? What experiments are going to be included in the design?

A

A low determinant indicates that we have low eigenvalues in the matrix and a determinant = 0 means that the matrix is singular and cannot be inverted. This means that the D optimal design is always going to include the “corners” of the squared box of all possible experiments because those values are the highest and will maximize the determinant.

44
Q

When we want to reduce the number of experiments to perform with D-optimal design, what is the smallest number of experiments we can choose?

A

The smallest number of experiments we can perform is the number of parameters of the model. If we have a full quadratic model with 2 variables then the smallest number of experiments will be 6 because we need at least w0 + w1x1 + w2x2 + w12x1x2 + w11x1x1 + w22x2x2.

45
Q

What is the function for creating the D-optimal design?

A

dexpy.optimal.build_optimal (number of factors, number of experiments, type of model).

46
Q

How can we test if a D-optimal design finds the same coefficients as the full factorial design?

A

We can test if a D-optimal design gives the same importance for the coefficients as the full factorial design by using the D-optimal design to predict the response and then use ordinary least squares to find the coefficients.

47
Q

Explain the difference between the data modeling culture and the algorithmic modeling culture according to Breiman’s article.

A

The data modeling culture assumes one model of what the world looks like and then we sample from that world create confidence intervals ect. and test if the model is impossible. Wether the model is a good fit is based on tests like for example residual examination and goodness-of-fit tests. Breiman argues that this is a very weak test because the model might not be correct just because it is not impossible and the tests to see if the model is a good fit will only reject the model when lack of fit is extreme. The algorithmic modeling culture tests a model to see if the prediction accuracy is high using that specific model. Here we look at the performance from example cross-validations.

48
Q

Explain screening and response surface modeling in experimental design.

A

In the screening step we we choose a simple screening model y = wTz where z only contains linear elements to study which variables seem to have the biggest impact on the response by performing OLS on the linear model and judging the size of the coefficients. We then use the variables we selected in the screening step to perform a full quadratic response model y = wTz but here z contains all interaction terms and quadratic elements. We perform the experiments and collect the responses and then perform OLS again to minimize the coefficients.

49
Q

Why is it hard to judge the importance of variables using interaction terms?

A

Because the values of the interacting variables are affecting each other. If one if them has the value of 0 then the other will be 0 to.

50
Q

Explain the maximum likelihood estimate and the maximum posteriori estimate.

A

Maximum likelihood estimation is like saying I don’t have an opinion about the data before I see it and the MLE is the parameter value with the highest likelihood value given the data. This is calculated using the likelihood function. The bayesian parameter estimation is like saying that I do have an opinion about the data before I see it and the MAP value is the estimated true value given the likelihood and the prior knowledge. This is calculated using Bayes theorem.

51
Q

How do we calculate the likelihood function?

A

It is the product of all the probabilities of the values drawn from the distribution.

52
Q

Explain Bayes theorem

A

p(θ|D, I) = p(D|θ, I)p(θ|I) / p(D|I)The posteriori estimation is the likelihood value * the prior / a normalization constant.

53
Q

What happens when the prior in Bayes theorem is constant?

A

Then the only element left in bayes theorem describing the MAP value is the likelihood meaning that if we want to maximize MAP the it is enough to maximize the MLE. If the prior is not constant then we have to maximize the likelihood*prior constant to maximize MAP.

54
Q

How do we maximize the maximum likelihood estimation?

A

We draw values from an assumed distribution and for each x value we calculate the probability for that value given each parameter we try in a grid search with many parameter values. We then multiply each probability to get the likelihood value and then we simply choose the parameter value with the highest likelihood value

55
Q

If the pearson’s correlation coefficient = 0, what conclusions can we draw?

A

That there is no linear dependency. There could still be a non linear relationship.

56
Q

You carry out a performance test of 10 new instruments and you count how many of them did not pass the test. Your finding is that all instruments passed the test! However, since you have only tested 10 instruments, a lot of uncertainty remains. Assume for example that the actual (true) probability of producing a faulty instrument is 5%. Explain in pseudocode/words (no code) how by means of an urn model and resampling can calculate the probability of observing no faulty instruments if you pick 10 randomly from the factory assembly line.

A
  1. Create urn of true error rate2. Draw 10 objects from urn with replacement3. number of times you get no faulty instruments / number of draws = probability.