Statistical inference Flashcards

Question

What is ridge regression? What is the difference from OLS?

Answer 1

Ridge regression is similar to an ordinary least squares fitting in the sense that we are trying to find the coefficients of x so that the residuals are minimized. The difference is that with ridge regression we also have an alfa value that is described as a penalty for choosing large coefficients in the produced model. If alfa = 0 then it is the same thing as an ordinary least squares fitting because there is no penalty and the model is free to distribute the coefficients so that one feature is much more important than another.

Answer 2

If alfa is large then the model will be encouraged to distribute the coefficients equally across all features. This means that if we have multicollinearity, at least we reduce the risk of the correlated features being a lot more important for the change of the response value than the other features. The alfa also helps to reduce the risk for overfitting the model to the training data which is a risk with large coefficients.

Answer 3

It is a regression model where we predict the values of the response of a new observation by looking at the response values of the previously observed observations. When we add a new observation, we assign to it the y-value of the closest observation. If k=1 we use the closest observation. If k=2 then we use the 2 closest observations and the average of their y-values.

Answer 4

If K is set to the same value as the number of observations we will just take the same average each time for the predicted response of the new observed x.

Answer 5

1. The dataset is divided into 70% training (Dtrain) and 30% testing (Dtest).2. Use the function compare_models() to fit all models in a family (here regression models) to Dtrain using k-fold cross validation with default K=10 and rank performance of the models where the performance value is the average of each fold. 3. Tune the model to find the optimal set of hyper parameters that optimize the performance. The function tests different hyperparameters for the model and trains and tests them again using k-fold cross validation. 4. The performance is only based on the training set. Therefore we use the function predict_model to let the model predict the responses on Dtest. When we look at the performance values from Dtest they should not be significantly lower than the mean of all cross validations since this would indicate that the model is overly fitted to Dtrain.5. Finalize_model() is a function that trains and evaluates the model again on the entire dataset. After this we let the model predict responses on completely unseen data.

Answer 6

When training the final model on the entire dataset we have more data that could have meant different hyperparameters if that data was a part of the tuning. This means that the hyperparameters as well as the model family might be suboptimal now that we use the entire dataset and have more information.Another problem is that we use up all data we have for training and we have no data left to treat as “unseen” to test the models performance on new data.

Answer 7

The problem here is that the function does not consider other hyperparameters than the defaults and therefore it might favor one model over another. The knn with default k might be better than ridge with default parameters, but ridge might be the better choice if we looked at another hyperparameter value. This means that the model chosen could be suboptimal.

Answer 8

Running a full factorial design means that all possible combinations of factor levels are tested. If you have a 2 level design meaning that you have 2 values that the variable can take on (maximum and minimum, -1 and 1) and 5 variables, then the full factorial design is 2^5=32 . So the number of unique experiments to run is k^p and when you run the full factorial design, all unique experiments are tested.

Answer 9

The full factorial design usually leads to a large number of experiments to perform. We can reduce the number of experiments by doing a fractional factorial design by treating the higher order variables as the product of the other variables. If we for example treat x5 as the product of x1*x2*x3*x4, we do a full factorial (test all unique combinations) for x1-x4 and we will instead get 2^5-1=16 experiments.

Answer 10

We will not know if x5 or the product of x1-x4 is the important feature if x5 got a big coefficient.

Answer 11

To build the fractional factorial we use the factorial.build_factorial(5, 2**(5-1)) function from the dexpy library. The first 5 indicates that we have 5 independent variables, 2 indicates the level and 5-1 is the fractional factorial. From this we will get a table with all 5 variables as columns and the 16 experiments as rows.

Answer 12

We add center points to the design if the model we want to fit has quadratic elements. This is because a 2 level design cannot capture the models deviations from a straight line since we do not have enough levels - we need at least 3 for a quadratic model. Therefore we add center points to the features that have real values. The center points added are extra experimental runs with values for the features that are the midpoints of the two existing levels.This will lead to an extra number of experiments that corresponds to the number of unique combinations we can add to the experiment.

Answer 13

y = w0 + w1x1 + w2x2 + w3x3….. + wnxn.

Answer 14

y = w0 + w1x1 + w2x2 + w12x1x2 + w1x1x1 + w2x2x2.

Answer 15

Any model we choose to fit to our data can be described by yˆ = w^Tz where:w is a vector of the coefficients.z is a vector/matrix of the x-variables/the experimental designy^ is the predicted response from the model.

Answer 16

We can find the coefficients if we have the x values and the response values by using ordinary least squares that minimizes the sum of the squared residuals (minimizes (y - y^)^2).Searching for the minimum gives the closed form:wOLS = (Z^TZ) ^-1 = Z^Ty which is the same thing as Z^TZ = Z^Tw.This means that for any model we choose to fit we can use the closed form to find the coefficients that minimizes the residuals.

Answer 17

The values in the design will be encoded values to get them on the same scale so that they are directly comparable. When we later want to perform the actual experiment we would need to use the actual lows and highs in the design matrix.

Answer 18

Instead of using the fractional factorial design to reduce the number of experiments we can use the D-optimal design.This means that we use the k number of experiments that maximizes the determinant of the design matrix, meaning that we optimize the design.

Answer 19

A low determinant indicates that we have low values in the matrix and a determinant = 0 means that the matrix is singular and cannot be inverted. This means that the D optimal design is always going to include the “corners” of the squared box of all possible experiments because those values are the highest and will maximize the determinant.

Answer 20

The smallest number of experiments we can perform is the number of parameters of the model. If we have a full quadratic model with 2 variables then the smallest number of experiments will be 6 because we need at least w0 + w1x1 + w2x2 + w12x1x2 + w11x1x1 + w22x2x2.

Answer 21

dexpy.optimal.build_optimal (number of factors, number of experiments, type of model).

Answer 22

We can test if a D-optimal design gives the same importance for the coefficients as the full factorial design by using the D-optimal design to predict the response and then use ordinary least squares to find the coefficients.

Statistical inference Flashcards

(46 cards)