Multilevel Modelling Flashcards
What is multilevel data?
Data that has clustering at different hierarchical levels e.g. different pupils in different schools which are based in different areas.
What are 3 types of multilevel data?
Geographic: – e.g data of country will have data from different regions
Relational- relationships between different factors e.g., Schools, classrooms, pupils
Countries, regions, towns, families
Company, office, team, individuals
TemporalLongitudinal: Different time points
Where do we find multilevel data?
- Naturally occurring e.g schools
- Due to sample design e.g. clustered, multi-stage)
- Repeated measures
What is one way of differentiating between data that is multilevel and data that is not?
If can identify different hierarchies then that is a good example of multilevel data
If data is a single level with subgroups e.g. office with different ethnicities, this would not be considered multi level data
What are two reasons for computing multilevel analysis?
1.Clustering as a problem - Sometimes we have data that is clustered and must take this into account
- Clustering as a substantive interest- Answer multilevel hypotheses
Why can clustering be a problem?
- Clustered data violate the assumptions of simple random sampling.
- Traditional multiple regression techniques treat the units of analysis (e.g. individuals) as independent observations.
- If we fit hierarchical data with simple OLS models, (i.e. ignoring any clustering) the standard errors of our regression coefficient will be underestimated, leading to an overstatement of statistical significance.
Why can clustering be a problem?
- Clustered data violate the assumptions of simple random sampling.
- Traditional multiple regression techniques treat the units of analysis (e.g. individuals) as independent observations.
- If we fit hierarchical data with simple Ordinary least-squares (OLS) models, (i.e. ignoring any clustering) the standard errors of our regression coefficient will be underestimated, leading to an overstatement of statistical significance.
Why can clustering be a substantive interest?
We are often interested in estimating the amount of variation between groups, and the extent to which it can be explained by group-level explanatory variables.
E.g., Is there between-school variability in students’ academic progress?
Do health outcomes vary across areas?
Are between-area variations in health explained by differences in access to health services?
Is the amount of variation between areas different for rural and urban areas?
What are alternative data analysis strategies if clustering is not of substantive interest?
Fixed effects regression
Design-based modelling
Adjusted standard errors - Used most often when need to adjust for potential clustering
According to Richard McElreath what should multilevel regression be the default approach?
- Improved estimates for repeat sampling
- Improved estimates for imbalance in sampling
- Estimates of variation
- Avoid averaging, retain variation
What software can be used for multilevel analysis?
Stata
R
Julia
Python
Stan
brms
What terms are classified under multilevel modelling?
Multilevel model
Random effects model
Mixed model
Random coefficient model
Random parameter models
Hierarchical model
Nested models
Split-plot designs
Subject specific models
Variance component models
What is variance a measure of and how is it defined?
Variance is a measure of how spread out your data is.
Defined as the average of the squared differences from the mean.
What can we think of a residual as?
As a measure of error, or how far off your predicted value was from the actual value.
In a single level Ordinary least-squares regression what is the Bo represented by?
Bo - coefficient
What does a least squares regression select?
What is the value called?
The line with the lowest total sum of squared prediction errors.
Sum of Squares of Error, or SSE.
What does the notation yi = B0 +B1xi + Ei mean?
Presents single level model.
I represents number of individuals in sample
What does the notation yij +B0 + B1xij+ Eij mean?
Two level notation, outcome y for individual(s) in cluster j(ranges from 1 to however many clusters we have. If there was a three level model we would include convention k to represent this.
With clustered data, the residuals within each cluster are likely to be correlated. What does this mean?
Individuals within a cluster are likely to be more similar to one another than they are to individuals from another clusters
Explain what the formula yij = B0 + uj + Eij mean?
Uj represents distance from overall intercept B0 to group intercept
Eij represents distance from group level mean to observed data point at individual level
What is meant by partitioning variance?
Having two sources of variance - between groups variance and within group variance
What does we assume about epsilons?
They are normally distributed with a mean of 0 and variance of sigma squared e
What does the variance partition coefficient measure?
Proportion of total variance
is due to differences between groups
What is the formulae for variance partition coefficient?
Between group variance divided by total variance(between group variance plus within group variance)
What is the variance partition coefficient sometimes but not always equivalent to?
Intraclass correlation coefficient
What are two caveats of fitting a two level model and what is a possible solution?
- The random intercepts model introduced additional complexity.
- Did it result in a significant improvement in model fit? - We can test this with a likelihood-ratio test
What does a likelihood ratio test examine?
Group effects
What is the formulae for a likelihood ratio test?
D = -2 Log L2 (Fit of multilevel model) - -2 Log L1(Fit of single level model)
The deviance statistic (D) is compared against a X2 distribution with D.F. degrees of freedom.
What is the degrees of freedom?
The difference in the number of parameters in the multilevel vs. individual level model).