Statistical inference Flashcards

Question

What is ridge regression? What is the difference from OLS?

Answer 1

Ridge regression is similar to an ordinary least squares fitting in the sense that we are trying to find the coefficients of x so that the residuals are minimized. The difference is that with ridge regression we also have an alfa value that is described as a penalty for choosing large coefficients in the produced model. If alfa = 0 then it is the same thing as an ordinary least squares fitting because there is no penalty and the model is free to distribute the coefficients so that one feature is much more important than another.

Answer 2

If alfa is large then the model will be encouraged to distribute the coefficients equally across all features. This means that if we have multicollinearity, at least we reduce the risk of the correlated features being a lot more important for the change of the response value than the other features. The alfa also helps to reduce the risk for overfitting the model to the training data which is a risk with large coefficients.

Answer 3

It is a regression model where we predict the values of the response of a new observation by looking at the response values of the previously observed observations. When we add a new observation, we assign to it the y-value of the closest observation. If k=1 we use the closest observation. If k=2 then we use the 2 closest observations and the average of their y-values.

Answer 4

If K is set to the same value as the number of observations we will just take the same average each time for the predicted response of the new observed x.

Answer 5

1. The dataset is divided into 70% training (Dtrain) and 30% testing (Dtest). 2. Use the function compare_models() to fit all models in a family (here regression models) to Dtrain using k-fold cross validation with default K=10 and rank performance of the models where the performance value is the average of each fold. 3. Tune the model to find the optimal set of hyper parameters that optimize the performance. The function tests different hyperparameters for the model and trains and tests them again using k-fold cross validation. 4. The performance is only based on the training set. Therefore we use the function predict_model to let the model predict the responses on Dtest. When we look at the performance values from Dtest they should not be significantly lower than the mean of all cross validations since this would indicate that the model is overly fitted to Dtrain. 5. Finalize_model() is a function that trains and evaluates the model again on the entire dataset. After this we let the model predict responses on completely unseen data.

Answer 6

When training the final model on the entire dataset we have more data that could have meant different hyperparameters if that data was a part of the tuning. This means that the hyperparameters as well as the model family might be suboptimal now that we use the entire dataset and have more information. Another problem is that we use up all data we have for training and we have no data left to treat as “unseen” to test the models performance on new data.

Answer 7

The problem here is that the function does not consider other hyperparameters than the defaults and therefore it might favor one model over another. The knn with default k might be better than ridge with default parameters, but ridge might be the better choice if we looked at another hyperparameter value. This means that the model chosen could be suboptimal.

Answer 8

Running a full factorial design means that all possible combinations of factor levels are tested. If you have a 2 level design meaning that you have 2 values that the variable can take on (maximum and minimum, -1 and 1) and 5 variables, then the full factorial design is 2^5=32 . So the number of unique experiments to run is k^p and when you run the full factorial design, all unique experiments are tested.

Answer 9

The full factorial design usually leads to a large number of experiments to perform. We can reduce the number of experiments by doing a fractional factorial design by treating the higher order variables as the product of the other variables. If we for example treat x5 as the product of x1*x2*x3*x4, we do a full factorial (test all unique combinations) for x1-x4 and we will instead get 2^5-1=16 experiments.

Answer 10

We will not know if x5 or the product of x1-x4 is the important feature if x5 got a big coefficient.

Answer 11

To build the fractional factorial we use the factorial.build_factorial(5, 2**(5-1)) function from the dexpy library. The first 5 indicates that we have 5 independent variables, 2 indicates the level and 5-1 is the fractional factorial. From this we will get a table with all 5 variables as columns and the 16 experiments as rows.

Answer 12

We add center points to the design if the model we want to fit has quadratic elements. This is because a 2 level design cannot capture the models deviations from a straight line since we do not have enough levels - we need at least 3 for a quadratic model. Therefore we add center points to the features that have real values. The center points added are extra experimental runs with values for the features that are the midpoints of the two existing levels. This will lead to an extra number of experiments that corresponds to the number of unique combinations we can add to the experiment.

Answer 13

y = w0 + w1x1 + w2x2 + w3x3….. + wnxn.

Answer 14

y = w0 + w1x1 + w2x2 + w12x1x2 + w1x1x1 + w2x2x2.

Answer 15

Any model we choose to fit to our data can be described by yˆ = w^Tz where: w is a vector of the coefficients. z is a vector/matrix of the x-variables/the experimental design y^ is the predicted response from the model.

Answer 16

We can find the coefficients if we have the x values and the response values by using ordinary least squares that minimizes the sum of the squared residuals (minimizes (y - y^)^2). Searching for the minimum gives the closed form: wOLS = (Z^TZ) ^-1 = Z^Ty which is the same thing as Z^TZ = Z^Tw. This means that for any model we choose to fit we can use the closed form to find the coefficients that minimizes the residuals.

Answer 17

The values in the design will be encoded values to get them on the same scale so that they are directly comparable. When we later want to perform the actual experiment we would need to use the actual lows and highs in the design matrix.

Answer 18

Instead of using the fractional factorial design to reduce the number of experiments we can use the D-optimal design. This means that we use the k number of experiments that maximizes the determinant of the design matrix, meaning that we optimize the design.

Answer 19

A low determinant indicates that we have low eigenvalues in the matrix and a determinant = 0 means that the matrix is singular and cannot be inverted. This means that the D optimal design is always going to include the “corners” of the squared box of all possible experiments because those values are the highest and will maximize the determinant.

Answer 20

The smallest number of experiments we can perform is the number of parameters of the model. If we have a full quadratic model with 2 variables then the smallest number of experiments will be 6 because we need at least w0 + w1x1 + w2x2 + w12x1x2 + w11x1x1 + w22x2x2.

Answer 21

dexpy.optimal.build_optimal (number of factors, number of experiments, type of model).

Answer 22

We can test if a D-optimal design gives the same importance for the coefficients as the full factorial design by using the D-optimal design to predict the response and then use ordinary least squares to find the coefficients.

Answer 23

The data modeling culture assumes one model of what the world looks like and then we sample from that world create confidence intervals ect. and test if the model is impossible. Wether the model is a good fit is based on tests like for example residual examination and goodness-of-fit tests. Breiman argues that this is a very weak test because the model might not be correct just because it is not impossible and the tests to see if the model is a good fit will only reject the model when lack of fit is extreme. The algorithmic modeling culture tests a model to see if the prediction accuracy is high using that specific model. Here we look at the performance from example cross-validations.

Answer 24

In the screening step we we choose a simple screening model y = wTz where z only contains linear elements to study which variables seem to have the biggest impact on the response by performing OLS on the linear model and judging the size of the coefficients. Here the aim is the find the features that has the highest impact on the response. We then use the variables we selected in the screening step to perform a full quadratic response model y = wTz but here z contains all interaction terms and quadratic elements. We perform the experiments and collect the responses and then perform OLS again to minimize the coefficients but this time on the full model.

Answer 25

Because the values of the interacting variables are affecting each other. If one if them has the value of 0 then the other will be 0 to.

Answer 26

Maximum likelihood estimation is like saying I don't have an opinion about the data before I see it and the MLE is the parameter value with the highest likelihood value given the data. This is calculated using the likelihood function. The bayesian parameter estimation is like saying that I do have an opinion about the data before I see it and the MAP value is the estimated true value given the likelihood and the prior knowledge. This is calculated using Bayes theorem.

Answer 27

It is the product of all the probabilities of the values drawn from the distribution.

Answer 28

p(θ|D, I) = p(D|θ, I)p(θ|I) / p(D|I) The posteriori estimation is the likelihood value * the prior / a normalization constant.

Answer 29

Then the only element left in bayes theorem describing the MAP value is the likelihood meaning that if we want to maximize MAP the it is enough to maximize the MLE. If the prior is not constant then we have to maximize the likelihood*prior constant to maximize MAP.

Answer 30

We draw values from an assumed distribution and for each x value we calculate the probability for that value given each parameter we try in a grid search with many parameter values. We then multiply each probability to get the likelihood value and then we simply choose the parameter value with the highest likelihood value

Answer 31

That there is no linear dependency. There could still be a non linear relationship.

Answer 32

1. Create urn of true error rate 2. Draw 10 objects from urn with replacement 3. number of times you get no faulty instruments / number of draws = probability.

Answer 33

It is an estimate of the pdf of the underlying variable of a distribution and it is a mean of data smoothing. It will estimate the pdf very well in a more continuous way than a histogram that approximates the pdf in a more discrete way. With larger bandwidth the kde gets more smooth, the interval the pdf goes over gets wider and the maximum value on the y axis gets lower. The maximum y value = the maximum value of y given the specific sd / number of observations.

Answer 34

kernel - the distribution you are working with and gives the shape of the curve. bandwith - tells you the width of the kernel and is used to smoot the data. Too high bandwidth and the smoothing erases the patterns of the pdf.

Answer 35

A density estimator is an algorithm which seeks to model the probability distribution that generated a dataset. Kernel density estimate and histograms

Answer 36

The kernel density estimate is more robust than a histogram because the histogram can give different patterns due to different bin sizes and such. This is because with the kde we are replacing the blocks in the histogram with a gaussian smoothing function.

Answer 37

The choice of bandwidth within KDE is extremely important to finding a suitable density estimate. Too narrow a bandwidth leads to a high-variance estimate (i.e., over-fitting), where the presence or absence of a single point makes a large dierence. Too wide a bandwidth leads to a high-bias estimate (i.e., under-fitting) where the structure in the data is washed out by the wide kernel. I.e. small bandwidth gives too much specificity and you capture specific patterns of the specific samples you have and not the general pattern of the true underlying pdf. Larger bandwidth washes out the truth and you underfit the estimate. We can tune the bandwith using cross-validation.

Answer 38

Violin plots combine the features of box plots with kernel density estimation (KDE) plots, providing a more detailed representation of the data distribution. In addition to summarizing the central tendency and spread like box plots, violin plots also show the full distribution of the data by plotting KDEs mirrored on either side of the central box. This allows for a better understanding of the shape and skewness of the data distribution, providing richer insights into its characteristics. Violin plots are especially informative when dealing with complex or non-standard distributions, as they offer a more nuanced visualization of the data compared to box plots. I.e. violin plots can show subpopulations of the values while boxplots cannot.

Answer 39

x is a vector of observations. µ is a vector of mean values with as many values as dimensions you have. d is the number of dimensions. epsilon is the covariance matrix. |epsilon| is the determinant

Answer 40

Because the dimensions of vectors and matrices in the numerator multiply to scalars and the denominator consists of only scalars. row vector * symmetric matrix = row vector. row vector * column vector = scalar

Answer 41

The largest value on the y-axis falls on the mean value. So we get the largest value when the observation is equal to the mean. The exponent then is equal to 0 and the numerator is equal to 1. So the largest value is 1 / denominator.

Answer 42

To draw the noise independently for each variable.

Answer 43

No, we report p = 1 / N

Answer 44

Save space on your own machine. Updates are done automatic when its on the cloud ect.

Answer 45

Define the possible bootstrap sets: [2,3] [2,2] [3,2] [3,3] Percentile(25) = (N+1)(25/100) Percentile(75) = (N+1)(75/100)

Answer 46

The only choice we have is D-optimal design because a fractional factorial can only do a number of experiments as ^2.

Answer 47

Passive machine learning is when you don't choose your experiments and active is when you do get to choose.

Answer 48

Cost=Σ(yi — ŷi)2+λΣ(βj)2 The regularization term, λΣ(βj)2, ensures that the coefficients are minimized Advantages include: - multicollinearity mitigation - Reduce risk of overfitting - helps in variable selection by pushing coefficients towards zero. Disadvantages include: - Interpretability - choosing of penalty parameter - usually done with cross-validation

Statistical inference Flashcards

(72 cards)