Statistical inference Flashcards

1
Q

What is the difference between drawing values from a distribution with or without replacement?

A

With replacement means that after we have drawn a value we put it back and this way the draws does not change the distribution of values. Replacement is good to use when we want to simulate a big or infinite population where one draw does not change the probabilities of the next draw so much.

Without replacement means that we do not put the value back and the distribution changes with each draw. This should be used to simulate draws from a smaller finite population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a probability density function?

A

When we make stochastic observations with real values the distribution of those values are described by probability density functions. If we were to draw samples from an urn 1000 times and create a histogram of the drawn samples, the shape of the histogram would follow the probability density function for that variable where the highest number of drawn values will be the mean value and the variation the standard deviation (if the variable is normally distributed).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we get the probability of drawing a value x from a mixture of multiple different distributions?

A

Having different distributions means that we have a mixture of pdfs and they have different means and variations.

When we have different object types from different distributions, the distribution we end up drawing from is the pdfs multiplied by their weight summed together.

p(x) = w1 * p1(x) + w2 * p2(x) + w3 * p3(x)

  1. Draw one sample from a discrete multiple class distributions with probabilities of the weights.
  2. If the outcome is class i, then draw a sample from pi.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What parameters do you need to define to draw from a univariate normal distribution vs a multivariate normal distribution?

What do the results look like?

A

Univariate:
We need to define the sd and the mean. Result is one value from the distribution.

Multivariate:
Vector of means for the different distributions and a covariance matrix that describes how the individual variables vary and how they vary together.

Result is a vector with as many values as the number of distributions you have.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is it important to incorporate noise when we simulate stochastic observations?

A

When we simulate stochastic observations we should take into account that real values usually have some noise that affects the values we draw. The noise is important to know about because it can make a linear relationship appear not to be linear if we have too few samples. The impact of the noise gets smaller with a higher number of samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a Hidden Markov model?

A

Markov models are used to simulate stochastic processes.

They are memoryless models (only looks at the present state) that are based on probabilities. The model will simulate the most probable process by using state and emission probabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a conventional confidence interval?

A

Conventional confidence interval: This interval will contain the true mean with 95% chance and tells you that if you were to sample many times and construct 95% confidence intervals each time, the true mean will fall within that interval 95% of the time.

This because the conventional confidence interval is frequentistic There is still a 5% chance that the true mean falls on the “extreme values” not included in the confidence interval and 1/20 created confidence intervals might miss the true mean because of this.

Guaranteed coverage 1-alfa.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a bootstrap 95% percentile interval? What is the benefit and setbacks of using these intervals?

A

These are created for resampling. If you were to create many bootstrap datasets and calculate the mean of each and create a 95% percentile interval of the bootstrap means that interval will tell you nothing about if the true mean is inside that interval since it is not designed to have that coverage. It instead tells you how uncertain you are of your estimated mean in your samples.

The 2.5 and 97.5 percentiles of the bootstrap means indicate the lower and upper bounds of the percentile interval. These percentile intervals capture the middle 95% of the distribution, leaving 2.5% in the lower tail and 2.5% in the upper tail. If that interval is very wide it means that we should be more uncertain of our estimated mean.

The benefit is that with bootstrap intervals you do not need to worry about the distribution of the variable like you have to with conventional confidence intervals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How do we get the 95% bootstrap percentile interval?

A

We draw bootstrap datasets from the original dataset, calculate the metric we are interested in for each bootstrap set and save in a list (could be correlations, mean-values, fractions, standard deviations ect.) Then calculate the 2.5 and 97.5 percentiles of the bootstrap values and those values are the upper and lower bounds of the percentile interval [a,b].

This interval tells you how certain you are of your estimated mean because it tells you how much it can vary just by random chance.

The bigger the bootstrap set is the more data we have to look at and this will lead to a smaller interval that gives us more certainty because then the bootstrap represents the original dataset better.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is it more efficient to create a bootstrap interval after you have collected your sample population?

A

We do bootstrap sets instead of doing the observations 1000 times and bootstrapping is also beneficial because we do not have to worry about the distribution of the variable we are looking at. A conventional confidence interval will crash if the variable is not normally distributed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is hypothesis testing?

A

To do a hypothesis testing you are testing how likely it is to get an observed value under the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is hypothesis testing done?

A

You are calculating your observed value.

Then creating bootstrap sets from the NULL distribution and calculating the metric for each bootstrap set. You are then calculating the p-value by summing how many times the bootstrap metrics were as extreme or more extreme as the observed value and dividing the sum with the number of bootstrap sets. If you saw very few extreme values (under 5% usually) the null is not likely and we should reject it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is hypothesis testing with percentile intervals done?

A

There we see if 0 (null) is inside of the 95% percentile interval, and if it is then the null is likely and we cannot reject it. This is usually only done when the null states that a difference between two groups is 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the interquartile range? What is the benefit of using this over standard deviation?

A

IQR = Q3 - Q1. Removes the lowest 25% and the highest 25% values and only looks at the median of the remaining 50%. This metric is more robust as a metric of variety than standard deviation because the standard deviation is more sensitive to outliers because it looks at the entire dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is regression?

A

Regression: The problem of using a set of training examples to build a prediction model
f() that produces predictions f(x) where the observed response values yn are real valued (continuous).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is mean absolute error?

A

Mean absolute error = Sum of all absolute errors between true response and predicted response divided by the number of samples. So the average of the absolute errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is mean squared error?

A

Mean squared error = Sum of each error squared / number of samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is root mean squared error?

A

Root mean squared error = Square root of MSE.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the R^2 value?

A

How much of the variation in response is described by the features.

calculated by taking 1- (sum of squared errors / sum of all differences between yi and sample mean of y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is mean absolute percentage error?

A

Average of absolute value of ratio between absolute error and true value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is k-fold cross validation?

A

K-fold cross validation is a kind of resampling model where a model is trained and tested on training data and the training and testing is done on different subsets of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Explain how k-fold cross validation is done?

A

The training dataset is divided into k number of subsets and the model is trained on k-1 subsets and tested on the one the subset that is excluded from training. This is done k number of times so that the model is trained and tested on all subsets of the training dataset and so that test is always done on data that the model was not trained on.

We then take the average of all k number of performance values to get a good estimate of how the model will perform on unseen data of the same size as the training set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

In k-fold cross validation, what are the pros and cons of using large vs small k?

A

If we were to use fewer folds, it would reduce the training time and computational cost but smaller K also leads to more overlap between training sets, because each training set will cover more of the dataset. This is potentially allowing the model to capture more consistent patterns in the data but with this we also get the risk of overfitting the model to the training data.

Larger K will therefore reduce the risk of overfitting the data because the overlap between training sets will be smaller but the computational cost does however get bigger with bigger K.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is a hyper parameter? What is the hyperparameter of ridge regression and k-nearest neighbor regression?

A

A hyperparameter is a setting or configuration for a machine learning model that is set before the training process begins. Unlike the model parameters, which the algorithm learns from the training data, hyperparameters are external factors that influence how the learning process takes place.

The hyper parameter of ridge regression is the alfa penalty value and for k nearest neighbor it is the number of nearest neighbors we choose to look at (k).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is ridge regression? What is the difference from OLS?

A

Ridge regression is similar to an ordinary least squares fitting in the sense that we are trying to find the coefficients of x so that the residuals are minimized. The difference is that with ridge regression we also have an alfa value that is described as a penalty for choosing large coefficients in the produced model.

If alfa = 0 then it is the same thing as an ordinary least squares fitting because there is no penalty and the model is free to distribute the coefficients so that one feature is much more important than another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why is ridge regression better if you have correlation between some of the features in your dataset?

A

If alfa is large then the model will be encouraged to distribute the coefficients equally across all features. This means that if we have multicollinearity, at least we reduce the risk of the correlated features being a lot more important for the change of the response value than the other features. The alfa also helps to reduce the risk for overfitting the model to the training data which is a risk with large coefficients.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Explain k-nearest neighbor regression.

A

It is a regression model where we predict the values of the response of a new observation by looking at the response values of the previously observed observations.

When we add a new observation, we assign to it the y-value of the closest observation. If k=1 we use the closest observation. If k=2 then we use the 2 closest observations and the average of their y-values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

In k-nearest neighbor regression, what happens if we set k to be equal to the number of observations we have?

A

If K is set to the same value as the number of observations we will just take the same average each time for the predicted response of the new observed x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is the workflow of PyCaret?

A
  1. The dataset is divided into 70% training (Dtrain) and 30% testing (Dtest).
  2. Use the function compare_models() to fit all models in a family (here regression models) to Dtrain using k-fold cross validation with default K=10 and rank performance of the models where the performance value is the average of each fold.
  3. Tune the model to find the optimal set of hyper parameters that optimize the performance. The function tests different hyperparameters for the model and trains and tests them again using k-fold cross validation.
  4. The performance is only based on the training set. Therefore we use the function predict_model to let the model predict the responses on Dtest. When we look at the performance values from Dtest they should not be significantly lower than the mean of all cross validations since this would indicate that the model is overly fitted to Dtrain.
  5. Finalize_model() is a function that trains and evaluates the model again on the entire dataset. After this we let the model predict responses on completely unseen data.
30
Q

What is the setback of using the finalize_model function in PyCaret?

A

When training the final model on the entire dataset we have more data that could have meant different hyperparameters if that data was a part of the tuning. This means that the hyperparameters as well as the model family might be suboptimal now that we use the entire dataset and have more information.

Another problem is that we use up all data we have for training and we have no data left to treat as “unseen” to test the models performance on new data.

31
Q

What is the setback of using the compare_models function in PyCaret?

A

The problem here is that the function does not consider other hyperparameters than the defaults and therefore it might favor one model over another. The knn with default k might be better than ridge with default parameters, but ridge might be the better choice if we looked at another hyperparameter value. This means that the model chosen could be suboptimal.

32
Q

In experimental design, what does it mean to run a full factorial design?

A

Running a full factorial design means that all possible combinations of factor levels are tested. If you have a 2 level design meaning that you have 2 values that the variable can take on (maximum and minimum, -1 and 1) and 5 variables, then the full factorial design is 2^5=32 .

So the number of unique experiments to run is k^p and when you run the full factorial design, all unique experiments are tested.

33
Q

What is a fractional factorial design? Why do we use it?

A

The full factorial design usually leads to a large number of experiments to perform. We can reduce the number of experiments by doing a fractional factorial design by treating the higher order variables as the product of the other variables. If we for example treat x5 as the product of x1x2x3*x4, we do a full factorial (test all unique combinations) for x1-x4 and we will instead get 2^5-1=16 experiments.

34
Q

What is the setback of using a fractional factorial design?

A

We will not know if x5 or the product of x1-x4 is the important feature if x5 got a big coefficient.

35
Q

How do we build a fractional factorial design?

A

To build the fractional factorial we use the factorial.build_factorial(5, 2**(5-1)) function from the dexpy library. The first 5 indicates that we have 5 independent variables, 2 indicates the level and 5-1 is the fractional factorial. From this we will get a table with all 5 variables as columns and the 16 experiments as rows.

36
Q

What are centerpoints in experimental design and why do we need them?

A

We add center points to the design if the model we want to fit has quadratic elements. This is because a 2 level design cannot capture the models deviations from a straight line since we do not have enough levels - we need at least 3 for a quadratic model. Therefore we add center points to the features that have real values.

The center points added are extra experimental runs with values for the features that are the midpoints of the two existing levels.

This will lead to an extra number of experiments that corresponds to the number of unique combinations we can add to the experiment.

37
Q

What does a plain linear model look like?

A

y = w0 + w1x1 + w2x2 + w3x3….. + wnxn.

38
Q

What does a full quadratic model with 2 variables look like?

A

y = w0 + w1x1 + w2x2 + w12x1x2 + w1x1x1 + w2x2x2.

39
Q

What is yˆ = w^Tz?

A

Any model we choose to fit to our data can be described by yˆ = w^Tz where:

w is a vector of the coefficients.

z is a vector/matrix of the x-variables/the experimental design

y^ is the predicted response from the model.

40
Q

How can we use yˆ = w^Tz to find the coefficients?

A

We can find the coefficients if we have the x values and the response values by using ordinary least squares that minimizes the sum of the squared residuals (minimizes (y - y^)^2).

Searching for the minimum gives the closed form:
wOLS = (Z^TZ) ^-1 = Z^Ty which is the same thing as Z^TZ = Z^Tw.

This means that for any model we choose to fit we can use the closed form to find the coefficients that minimizes the residuals.

41
Q

Why do we encode the values of the features in the design matrix?

A

The values in the design will be encoded values to get them on the same scale so that they are directly comparable. When we later want to perform the actual experiment we would need to use the actual lows and highs in the design matrix.

42
Q

What is D-optimal design?

A

Instead of using the fractional factorial design to reduce the number of experiments we can use the D-optimal design.

This means that we use the k number of experiments that maximizes the determinant of the design matrix, meaning that we optimize the design.

43
Q

In D-optimal design, what does it mean that we find the experiments that maximizes the determinant? What experiments are going to be included in the design?

A

A low determinant indicates that we have low eigenvalues in the matrix and a determinant = 0 means that the matrix is singular and cannot be inverted. This means that the D optimal design is always going to include the “corners” of the squared box of all possible experiments because those values are the highest and will maximize the determinant.

44
Q

When we want to reduce the number of experiments to perform with D-optimal design, what is the smallest number of experiments we can choose?

A

The smallest number of experiments we can perform is the number of parameters of the model. If we have a full quadratic model with 2 variables then the smallest number of experiments will be 6 because we need at least w0 + w1x1 + w2x2 + w12x1x2 + w11x1x1 + w22x2x2.

45
Q

What is the function for creating the D-optimal design?

A

dexpy.optimal.build_optimal (number of factors, number of experiments, type of model).

46
Q

How can we test if a D-optimal design finds the same coefficients as the full factorial design?

A

We can test if a D-optimal design gives the same importance for the coefficients as the full factorial design by using the D-optimal design to predict the response and then use ordinary least squares to find the coefficients.

47
Q

Explain the difference between the data modeling culture and the algorithmic modeling culture according to Breiman’s article.

A

The data modeling culture assumes one model of what the world looks like and then we sample from that world create confidence intervals ect. and test if the model is impossible. Wether the model is a good fit is based on tests like for example residual examination and goodness-of-fit tests. Breiman argues that this is a very weak test because the model might not be correct just because it is not impossible and the tests to see if the model is a good fit will only reject the model when lack of fit is extreme.

The algorithmic modeling culture tests a model to see if the prediction accuracy is high using that specific model. Here we look at the performance from example cross-validations.

48
Q

Explain screening and response surface modeling in experimental design.

A

In the screening step we we choose a simple screening model y = wTz where z only contains linear elements to study which variables seem to have the biggest impact on the response by performing OLS on the linear model and judging the size of the coefficients. Here the aim is the find the features that has the highest impact on the response.

We then use the variables we selected in the screening step to perform a full quadratic response model y = wTz but here z contains all interaction terms and quadratic elements. We perform the experiments and collect the responses and then perform OLS again to minimize the coefficients but this time on the full model.

49
Q

Why is it hard to judge the importance of variables using interaction terms?

A

Because the values of the interacting variables are affecting each other. If one if them has the value of 0 then the other will be 0 to.

50
Q

Explain the maximum likelihood estimate and the maximum posteriori estimate.

A

Maximum likelihood estimation is like saying I don’t have an opinion about the data before I see it and the MLE is the parameter value with the highest likelihood value given the data. This is calculated using the likelihood function.

The bayesian parameter estimation is like saying that I do have an opinion about the data before I see it and the MAP value is the estimated true value given the likelihood and the prior knowledge. This is calculated using Bayes theorem.

51
Q

How do we calculate the likelihood function?

A

It is the product of all the probabilities of the values drawn from the distribution.

52
Q

Explain Bayes theorem

A

p(θ|D, I) = p(D|θ, I)p(θ|I) / p(D|I)

The posteriori estimation is the likelihood value * the prior / a normalization constant.

53
Q

What happens when the prior in Bayes theorem is constant?

A

Then the only element left in bayes theorem describing the MAP value is the likelihood meaning that if we want to maximize MAP the it is enough to maximize the MLE.

If the prior is not constant then we have to maximize the likelihood*prior constant to maximize MAP.

54
Q

How do we maximize the maximum likelihood estimation?

A

We draw values from an assumed distribution and for each x value we calculate the probability for that value given each parameter we try in a grid search with many parameter values. We then multiply each probability to get the likelihood value and then we simply choose the parameter value with the highest likelihood value

55
Q

If the pearson’s correlation coefficient = 0, what conclusions can we draw?

A

That there is no linear dependency. There could still be a non linear relationship.

56
Q

You carry out a performance test of 10 new instruments and you count how many of them did not pass the test. Your finding is that all instruments passed the test! However, since you have only tested 10 instruments, a lot of uncertainty remains.

Assume for example that the actual (true) probability of producing a faulty instrument is 5%. Explain in pseudocode/words (no code) how by means of an urn model and resampling can calculate the probability of observing no faulty instruments if you pick 10 randomly from the factory assembly line.

A
  1. Create urn of true error rate
  2. Draw 10 objects from urn with replacement
  3. number of times you get no faulty instruments / number of draws = probability.
57
Q

What is the kernel density estimate? What happens with larger bandwidth?

A

It is an estimate of the pdf of the underlying variable of a distribution and it is a mean of data smoothing. It will estimate the pdf very well in a more continuous way than a histogram that approximates the pdf in a more discrete way.

With larger bandwidth the kde gets more smooth, the interval the pdf goes over gets wider and the maximum value on the y axis gets lower. The maximum y value = the maximum value of y given the specific sd / number of observations.

58
Q

What are the free parameters of the kernel density estimate?

A

kernel - the distribution you are working with and gives the shape of the curve.

bandwith - tells you the width of the kernel and is used to smoot the data. Too high bandwidth and the smoothing erases the patterns of the pdf.

59
Q

What is a density estimate? Give examples of two density estimators

A

A density estimator is an algorithm which seeks to model the probability
distribution that generated a dataset.

Kernel density estimate and histograms

60
Q

Why is it better to use a kernel density estimate instead of a histogram?

A

The kernel density estimate is more robust than a histogram because the histogram can give different patterns due to different bin sizes and such.

This is because with the kde we are replacing the blocks in the histogram with a gaussian smoothing function.

61
Q

What happens to the pdf if we use small or large bandwidths in the kernel density function?

A

The choice of bandwidth within KDE is extremely important to finding a suitable density estimate. Too narrow a bandwidth leads to a high-variance estimate (i.e., over-fitting), where the presence or absence of a single point makes a large di erence. Too wide a bandwidth leads to a high-bias estimate (i.e., under-fitting) where the structure in the data is washed out by the wide kernel.

I.e. small bandwidth gives too much specificity and you capture specific patterns of the specific samples you have and not the general pattern of the true underlying pdf. Larger bandwidth washes out the truth and you underfit the estimate.

We can tune the bandwith using cross-validation.

62
Q

What are violin plots? Why are they preferable over for example box plots?

A

Violin plots combine the features of box plots with kernel density estimation (KDE) plots, providing a more detailed representation of the data distribution.

In addition to summarizing the central tendency and spread like box plots, violin plots also show the full distribution of the data by plotting KDEs mirrored on either side of the central box. This allows for a better understanding of the shape and skewness of the data distribution, providing richer insights into its characteristics. Violin plots are especially informative when dealing with complex or non-standard distributions, as they offer a more nuanced visualization of the data compared to box plots.

I.e. violin plots can show subpopulations of the values while boxplots cannot.

63
Q

Explain the expression for the multivariate normal distribution.

A

x is a vector of observations.

µ is a vector of mean values with as many values as dimensions you have.

d is the number of dimensions.

epsilon is the covariance matrix.

|epsilon| is the determinant

64
Q

Why does the right hand side of the expression for a multivariate normal distribution always collapse to a scalar?

A

Because the dimensions of vectors and matrices in the numerator multiply to scalars and the denominator consists of only scalars.

row vector * symmetric matrix = row vector.

row vector * column vector = scalar

65
Q

How do we get the largest value of the pdf for a multivariate normal distribution?

A

The largest value on the y-axis falls on the mean value. So we get the largest value when the observation is equal to the mean.

The exponent then is equal to 0 and the numerator is equal to 1.

So the largest value is 1 / denominator.

66
Q

What is important to think about when we simulate noise for stochastic observations of real values?

A

To draw the noise independently for each variable.

67
Q

Can I report a p value of 0?

A

No, we report p = 1 / N

68
Q

Benefits of using colab over downloading everything on your own machine?

A

Save space on your own machine.

Updates are done automatic when its on the cloud ect.

69
Q

You have a dataset D = [2,3]
Explain how you get the 50% percentile interval.

A

Define the possible bootstrap sets:
[2,3]
[2,2]
[3,2]
[3,3]

Percentile(25) = (N+1)(25/100)
Percentile(75) = (N+1)(75/100)

70
Q

Assume that you have 2 variables and a full quadratic model but you can only do 6 experiments, what method should you use to reduce the number of experiments?

A

The only choice we have is D-optimal design because a fractional factorial can only do a number of experiments as ^2.

71
Q

What is the difference between passive and active machine learning?

A

Passive machine learning is when you don’t choose your experiments and active is when you do get to choose.

72
Q

What are the advantages of using ridge regression over OLS?

A

Cost=Σ(yi — ŷi)2+λΣ(βj)2

The regularization term, λΣ(βj)2, ensures that the coefficients are minimized

Advantages include:
- multicollinearity mitigation
- Reduce risk of overfitting
- helps in variable selection by pushing coefficients towards zero.

Disadvantages include:
- Interpretability
- choosing of penalty parameter - usually done with cross-validation