Model Selection Flashcards
What is a fixed factor?
An explanatory variable where the level of the explanatory variable is meaningful.
If we wish to draw inferences about the effects of that particular level of the explanatory variable on the response variable we can
The factor is completely repeatable at all levels of the explanatory variable
What is a random factor?
An explanatory variable where the level of the explanatory variable is not meaningful.
E.g fishes in a population
Not exactly repeatable
What do linear models assume?
Main factors impact the outcome in a predictable way and all other variation is due to error.
This assumes independence of errors (errors are distributed independently throught the data set).
What do linear models assume?
Main factors impact the outcome in a predictable way and all other variation is due to error.
This assumes independence of errors (errors are distributed independently throught the data set).
When is independence of errors violated?
1) If you have repeated measurements from different biological subjects the effect of random differences between these subjects will not be distributed independently throughout the data set.
2) if the experimental design is nested random differences at higher levels of nesting will not be distributed independently throughout the data set.
What is independence of errors?
Errors are distributed independently throughout the data set
Why is taking into account of nesting important?
If a biological individual was nested within a group. The randomness of that individual may skew results of observations from that group if that randomness isn’t accounted for in the model.
What are mixed models?
Models that allow us to include both random and fixed explanatory variables.
Why are mixed models useful?
They allows us to fit models which accurately account for different sources of variation in the data set.
What is used to determine the importance of different factors in mixed models?
Likelihood ratio test
What is the likelihood of a model?
The probability of observing our data given the model.
These tell us if models are different from one another.
(Useful when you compare likelihoods between models)
What is a better likelihood score 15 or 20?
20
How do would you know if removal of a explanatory variable from the mixed model has an effect and the explanatory variable is important?
If p-value of likelihood ratio test (comparison of original model and model with removed variable) is <= 0.05 then the explanatory variable you removed is important.
What is a random intercepts model?
A model that assumes intercepts account for random differences between a variable and the slope is constant.
What is a random slopes and intercepts model?
A model where Random effects from person to person are captured by gradients as well as intercepts.
What is feature selection?
Selection of the most relevant explanatory factors (attributes)
What are some advantages of feature selection?
1) shorter training times for algorithms
2) over fitting is less likely
3) simpler models are easier to implement
4) can be used to sift through datasets with l000s of attributes e.g microarray
What are the methods of feature selection?
1) Filtering methods
2) wrapper methods
3) embedded methods
What are filtering methods of feature selection?
This is selecting the most interesting features by conducting hypothesis tests on each feature.
For each feature the null hypothesis “this feature does not explain a significant portion of the variation in the response variable” is tested.
Do you need to use a correction for filtering by hypothesis testing?
What correlations can you apply?
Yes
Bonferroni correction
False discovery rate
What does the Bonferroni correction control?
Controls probability of making at least one type one error
Makes sure the probability of making a Type 1 error is <=alpha
(Stringent).
What does the false discovery rate control?
Controls the overall proportion of type one errors made
You decide a pre defined threshold of acceptable type 1 errors (q)
P-values bellow q indicate significance
(Less stringent)
What are the strengths and weaknesses of filtering method?
Strengths: 1) computationally easy
2) fast to run
Weaknesses: 1) takes no account of interactions between explanatory variables
2) will select correlated features (that explain the same thing therefore you only need one) 3) confounding variation that isn't explained away may cause important features to be overlooked
What are wrapper methods for feature selection?
Methods that consider more than one feature at once.
Different subsets of features are tested to determine the collection of features which produce the best model of the data.
What are the different wrapper selection methods?
1) Stepwise regression
2) Recursive feature elimination
What is stepwise regression?
Stepwise increasing (forward) or decreasing (backwards) number of parameters in the model and comparing BIC/AIC.
What is BIC and AIC?
These are used to determine when the model is being improved or not. By seeking to find the model which strikes the right balance between fitting and over fitting the data.
Useful for wrapper method as you can see if the model is being improved or not by adding/ subtracting features.
What are the strengths and weaknesses of wrapper methods?
Strengths: 1) The explanatory power of some features might only be revealed once other features are accounted for.
2) Possible to allow for interactions
Weaknesses: 1) Higher computational time
2) More risk of over fitting
Is a BIC of 20 better than 10?
No
Lower BIC/AIC values indicate better models
What are Embedded methods of feature selection?
Feature selection is embedded in model construction process.
The algorithm decides itself which features are more important than others.