5. Aug 29th Flashcards
Today we’re talking about assumptions
Anytime you analyze data with statistics, you make some sort of assumptions (or the technique you use intrinsically makes assumptions).
- EX: Assume that your data is representative of a larger population
- – If it’s NOT representative, it’s not representative of truth
General assumptions of the linear model (5)
Key to regression AND the first half of the class covering general linear models
1) Your Y data is continuous
- – If it is categorical (mortality “lived/died”) then regression won’t work
- – You don’t really need to TEST this: you’ll just know
2) Your error is normally distributed
- – Some people say your Y data needs to be normally distributed to participate in the linear model > INCORRECT
- – yi = B0 + B1x + error(sigma)
- – Ex: comparing size of males to females
- —– End up with 2 modes (bimodal)
- —– But the ERROR is normally distributed
3) You will have a linear relationship between X & Y
- “This one’s a pet peeve of mine. An assumption that is often ignored in analyses.”
- In ecology, nothing is REALLY linear
- Breaking this assumption can really muck up your data
- – STORY: Post-doc. John Chase’s lab. Buys kiddie pools from Wal Mart, leaves them in a field with water, goes away for a month, comes back to see what habitated. Wrote a paper: X-predator density, Y-prey density. Predator was mosquito larvae, prey something. Clearly non-linear results. He DIDN’T CARE that his linear model didn’t match. He just wanted to know IF there was a relationship.
- He ALWAYS plots his data before running lm(), guesses at the shape (linear or non-linear)
4) Homoscedasticity
- Homo = same, scedasticity = dispersion/variance
- Means a CONSTANT variance, a constant standard deviation
- The standard deviation around your line doesn’t change
- – Vs. Heteroscedasticity
- —- Higher values of variance at higher values of x and y
- —- Most common form in ecology: low variation at low values of x and y, higher variation at higher values of x and y
- This is an assumption that doesn’t matter that much
- – You’ll get higher p-values
- – There are techniques to measure heteroscedasticity (weighted regression), but they’re beyond this course
- See in population abundance
5) No auto correlation
AKA All of your samples are independent
- Ex. through how this assumption might be violated
- Ex: sampling from a river. There’s a factory dumping pollutants up stream
— How does concentration of pollution change with distance from the factory
— You take samples every 5 meters
——(how much x changes as function of distance from factory downstream)
— How much can the pollution possibly change in 5 meters? Not much.
— We’d get a lot of auto-correlation - the measurement you got for X sample will be highly correlated with the previous measurement
—— If you’re taking repeated measurements over time, you are likely to get auto-correlation
—— The amount will be a function of how close together those spaces are temporally or spatially
In non-autocorrelated , they are normally distributed but not correlated
— autocorrelated is where each point is a function of the previous point
But ultimately, it doesn’t have much of an effect on slope or p-value.
We get into dealing with that in mixed effects models
Pseudoreplication != autocorrelation
Saying (axiom) that goes around regarding GLM
“ANOVA, regression, t-tests, are all ROBUST to violations of assumptions.”
- – Your error doesn’t have to be PERFECTLY normal
- – Y data doesn’t have to be PERFECTLY continuous
- – AKA Even if your assumption is violated, it is likely the data won’t be effected (too much).
What really happens when you violate your assumptions?
- Slopes are still fairly accurate (unbiased estimates of truth)
- P-values and confidence intervals will be conservative (larger than they should be/would be if assumptions not violated)
- – ANOTHER AXIOM: The more assumptions a statistical test makes, the more powerful is it (the smaller p-value it tends to give).
- —– As long as the assumptions are met.