9. Fri. Sept 14th Flashcards
Quiz
- Why are there so many different post hoc tests?
- They all attempt to balance type i and type ii error, and none of them is perfect - With a continuous variable x, the meaning of the beta is slope (or the amount of changei n y for each 1 unit change in x). What is the meaning of the beta when the x variable is categorical.
- Difference between that group and the reference gruop - What is the ultimate purpose of conducting post-hoc tests?
- To adjust the pair-wise p-values to maintain the experiment-wise error rate - If a relationship is known to be linear, how does Cottingham suggest you should distribute your treatments in an experiment?
- Many treatment levels with few replicates - If a relationship is known to be linear, how does Steury suggest you should distribute your treatments in an experiment?
- All replicates in two treatment levels at the extremes in natural variation (because that maximizes R squared, which maximizes power)
Mini review of what we’ve covered so far
- We basically reviewed all of stats 7000
- Need to know:
- – There’s a general linear model (insert equation)
- —- This model can capture data with a continuous x and y, and a categorical x and y
- —- In R you can use the same function lm()
- —— he only things that’s really difference is the interpretation of beta
- —- IF x continuous, beta is slope, if x is categorical its the difference between the gruops, and you might technically have more than 1 x, but they’re technically single categories within the x variable
- The 5 Assumptions
1.
Special case when x could be eitgher categorical or continuous
- His argument is always treat it as continuous unless there’s significant evidence of non-linearity
Now we expand on this model for the rest of the class
When saw that with the ANOVA, we don’t have to have just 1 one.
- We can put as many x’s in here as we want
- So TRUE linear model looks likes [insert equation]
- – The x’s can be any combination of categorical and continuous variables
- These tests are fancy names, just slight variations in the linear model, and various combinations of x’s
- – And WHY you would want to do those fancy things. Because you don’t HAVE to
You CAN have an unlimited number of x’s, but in practice/computational limits, it’s not true.
- These next models have to be fit ITERATIVELY
- – Goings to sum of squared error, minimizing the sum of squared error (minimize standard deviation)
Think of each x as a dimension of this curve, and each x represents a dimension of this y curve
- You are practically limited to somewhere between 8-10 x’s
- – And that is somewhat determined by your sample size
- —– You don’t want 8 samples and 8 x’s
- – You need a MINIMUM of 10 samples PER x variable
- For the betas to be accurate on average, that’s what you need
After 12 to 14 x’s, your computer will start to smoke and explode. They just can’t do it.
There’s more computational power in your phone than NASA had to send men to the moon.
Multi-Variable Model
A model with multiple x’s
- Some people call them multi-variate models WHICH IS INCORRECT
- – Those are something very specific that we talk about after the mid term
You do NOT have to use a multivariable model
Ex: You have a categorical x and a continuous x
- You can run two separate general linear models
- NOTHING WRONG WITH THAT
BUT
There are a number of advantages to running multivariable models AS OPPOSED to analyzing all your x’s separately
6 advantages of multivariable models
- They are more elegant
- The best experiments are the ones that don’t need any statistics at all. (You just throw up a bar chart)
- But IT IS EASIER to get published if you do them
- – It SHOULDN’T be that way, but people are impressed by high-power statistics
- —- And that leads to unknowledgeable folks using high-powered them inproperly
“If I cant’ understand what you did, that’s a problem.”
- And he’s a good statistician
- Some professors just let it slide to not look dumb
- Because of “swamping” (his term)
- When the effect of one x variable is masked (or swamped out) by the effect of another x-variable
- Goes back to the idea that one of the things that influences your p-value is noise/error in the system
- – Makes it harder to determine significant effect
- So if there’s an X that exists that ISN’T in your model, it’s going to cause problems - Colinearity
Talk about this on Monday and Wednesday
- When two or more x-variables are correlated with each other
- It’s a HUGE problem in MANY sciences
- Unfortunately, most ecologists do not understand how to deal with it
— Ask them “I have colinearity, what do I do?” “Just take one out” Which is often the worst thing you can do
— Autocorrelation: your two different SAMPLES are related to each other
— Colinearity: is when x VARIABLES are related (totally different) - You may have interactions
- Interactions are THE COOLEST thing in ecology
- – We’re gonna read a paper that’s in Science ONLY BECAUSE it has interaction, which is coveted, and the people in the article didn’t even realize it
- We’ll spend a whole week on that
- It is when the effect of one x-variable (the Beta) depends on the VALUE of another x-variable
- – Ex, difference between males and females (effect of age on size depends upon which sex you’re talking about) - We may want to include random blocking variables
- Go back to swamping, kind of
- Happens when you measure something repeatedly (spatially or temporally) and we need to include that variable in the model because it explains some of the noise.
- About 2 weeks
- Generally, doing this (including random blocking variables) is a good thing - We may need to account for pseudo-replication
- We’ll spend a few days talking about it
- – Big problem, easy to do without knowing it
Example: An ANCOVA
The simplest multivariable model to understand
- Analysis of Covariance
Other definitions you’ll need to know in the literature:
- Covariate: a continuous x
- Factor: a categorical x
Typically has one continuous x and one categorical x
- y is usually still error normally distributed
BIG EXAMPLE
Continuous X: Age
Categorical X: Sex
Y: Size
We might have something that looks like [dot graph]
- ?Dif colors for male/female dots?
Size = B0 + B1Age + B2Sex + E
Dummy code: 1 male, 0 female
WE DO NOT HAVE TO RUN THIS MULTIVARIATE MODEL
- We could analyze the effects separately (of age and sex)
- We’ll still get a good estimate
- Beta is odd, between the two,
- And sigma is large: it has to capture noise
But, if we run the multivariable model, we essentially get 2 equations
- 1 for females
Size Equation for Females
female size to age
B0 + B1Age + E
B0 = females at age 0 B1= slope of age/size relationship
Size Equation for Males
(B0 + B2) + B1Age + E
- Beta 2
- The slope doesn’t change. The only thing that changes is our intercept
IMPORTANT
When you run the multi model, the mean of the betas hasn’t changed from as if we ran uni-variable models
- still how much size changes for each unit change in age
What would be mean of beta if:
- uni-variable (GLM): difference between males and females
- multi variable: STILL just the difference
The mean of the beta hasn’t changed AT ALL
What has changed is SIGMA
- Gives us a MUCH smaller error
- So p-values and confidence intervals will go down
That’s the essence of SWAMPING
- if we ran the first model, we might not actually catch the effect of sex because there’s SO much error in the system
- – Because there’s so much overlap in the data
- – If we hadn’t added SEX to the just size to age graph, you wouldn’t explain a CRAZY amount of noise
What if these lines aren’t parallel?
- I don’t want to answer that for at least a week
- – Because then you have an interaction, and it complicates things so much
- —- We want to always ASSUME these lines are parallel unless we have significant evidence that they are not?
Let’s do this in R
Data in the syllabus
- Look at it in Excel first
Col 1: Age
2: Sex
3: SexN
4: Error
5: Size (truth)
And error is 1
In R
datum=read.csv(file.choose())
head(datum)
plot(Size~Age,data=datum)
results=lm(Size~Age, data=datum)
summary(results)
What he goes over in the summary(results)
If we don’t care about the sex
Estimate of truth Standard error Confidence interval (2* standard error) P-value The KEY: R-squared: 98% of variation is driven by age
THEN he does size as a function of sex, SAME data
results=lm(Size~Sex,data=datum)
summary(results)
Estimate is good, but significance?
Confidence interval
P-value
R-squared
We KNOW there’s a variation in the data depending on sex BECAUSE WE MADE THE DATA
- Why didn’t we catch it?
- There’s so much noise
So let’s do it as ANCOVA
results=lm(Size~Sex+Age,data=datum)
Confidence interval (went from 9 to .5!!!)
P-value
Standard error
ALL we did was add age into the model
- A proto-typical example of swamping
- All that noise caused by age has been explained
- So consequently now we CAN detect the effect of sex
Where would you see this
As a scientist, I think there might be this one factor that people hadn’t studied yet
You’ve eliminated your way out of some of that noise
- If there is some data that you don’t CARE about because other people have tracked it, STILL COLLECT THAT DATA
- – Because if you don’t, it drives your results into CRAZY, insignificant directions
Something else kind of important, tangential to the base topic but important to understand
Anyone use SAS before?
- When you run a model like this in SAS, it always gives you 2 results
- – Type III sum of squares
- – Type I sum of squares
It’s important EVEN IF you never use SAS
- You need to understand the dif b/t type 1 sum of squares and type III sum of squares
Important take away
In this type, ORDER MATTERS
Important take away
In this type, ORDER MATTERS