9. Fri. Sept 14th Flashcards

Question 1

Q

Quiz

Answer

A

Why are there so many different post hoc tests?
- They all attempt to balance type i and type ii error, and none of them is perfect
With a continuous variable x, the meaning of the beta is slope (or the amount of changei n y for each 1 unit change in x). What is the meaning of the beta when the x variable is categorical.
- Difference between that group and the reference gruop
What is the ultimate purpose of conducting post-hoc tests?
- To adjust the pair-wise p-values to maintain the experiment-wise error rate
If a relationship is known to be linear, how does Cottingham suggest you should distribute your treatments in an experiment?
- Many treatment levels with few replicates
If a relationship is known to be linear, how does Steury suggest you should distribute your treatments in an experiment?
- All replicates in two treatment levels at the extremes in natural variation (because that maximizes R squared, which maximizes power)

Question 2

Q

Mini review of what we’ve covered so far

Answer

A

We basically reviewed all of stats 7000
Need to know:
– There’s a general linear model (insert equation)
—- This model can capture data with a continuous x and y, and a categorical x and y
—- In R you can use the same function lm()
—— he only things that’s really difference is the interpretation of beta
—- IF x continuous, beta is slope, if x is categorical its the difference between the gruops, and you might technically have more than 1 x, but they’re technically single categories within the x variable
The 5 Assumptions
1.

Special case when x could be eitgher categorical or continuous
- His argument is always treat it as continuous unless there’s significant evidence of non-linearity

Question 3

Q

Now we expand on this model for the rest of the class

Answer

A

When saw that with the ANOVA, we don’t have to have just 1 one.

We can put as many x’s in here as we want
So TRUE linear model looks likes [insert equation]
– The x’s can be any combination of categorical and continuous variables
These tests are fancy names, just slight variations in the linear model, and various combinations of x’s
– And WHY you would want to do those fancy things. Because you don’t HAVE to

You CAN have an unlimited number of x’s, but in practice/computational limits, it’s not true.

These next models have to be fit ITERATIVELY
– Goings to sum of squared error, minimizing the sum of squared error (minimize standard deviation)

Think of each x as a dimension of this curve, and each x represents a dimension of this y curve

You are practically limited to somewhere between 8-10 x’s
– And that is somewhat determined by your sample size
—– You don’t want 8 samples and 8 x’s
– You need a MINIMUM of 10 samples PER x variable
For the betas to be accurate on average, that’s what you need

After 12 to 14 x’s, your computer will start to smoke and explode. They just can’t do it.

There’s more computational power in your phone than NASA had to send men to the moon.

Question 4

Q

Multi-Variable Model

Answer

A

A model with multiple x’s

Some people call them multi-variate models WHICH IS INCORRECT
– Those are something very specific that we talk about after the mid term

You do NOT have to use a multivariable model
Ex: You have a categorical x and a continuous x
- You can run two separate general linear models
- NOTHING WRONG WITH THAT

BUT
There are a number of advantages to running multivariable models AS OPPOSED to analyzing all your x’s separately

Question 5

Q

6 advantages of multivariable models

Answer

A

They are more elegant
- The best experiments are the ones that don’t need any statistics at all. (You just throw up a bar chart)
- But IT IS EASIER to get published if you do them
- – It SHOULDN’T be that way, but people are impressed by high-power statistics
- —- And that leads to unknowledgeable folks using high-powered them inproperly

“If I cant’ understand what you did, that’s a problem.”

And he’s a good statistician
Some professors just let it slide to not look dumb

Because of “swamping” (his term)
- When the effect of one x variable is masked (or swamped out) by the effect of another x-variable
- Goes back to the idea that one of the things that influences your p-value is noise/error in the system
- – Makes it harder to determine significant effect
- So if there’s an X that exists that ISN’T in your model, it’s going to cause problems
Colinearity
Talk about this on Monday and Wednesday
- When two or more x-variables are correlated with each other
- It’s a HUGE problem in MANY sciences
- Unfortunately, most ecologists do not understand how to deal with it
— Ask them “I have colinearity, what do I do?” “Just take one out” Which is often the worst thing you can do
— Autocorrelation: your two different SAMPLES are related to each other
— Colinearity: is when x VARIABLES are related (totally different)
You may have interactions
- Interactions are THE COOLEST thing in ecology
- – We’re gonna read a paper that’s in Science ONLY BECAUSE it has interaction, which is coveted, and the people in the article didn’t even realize it
- We’ll spend a whole week on that
- It is when the effect of one x-variable (the Beta) depends on the VALUE of another x-variable
- – Ex, difference between males and females (effect of age on size depends upon which sex you’re talking about)
We may want to include random blocking variables
- Go back to swamping, kind of
- Happens when you measure something repeatedly (spatially or temporally) and we need to include that variable in the model because it explains some of the noise.
- About 2 weeks
- Generally, doing this (including random blocking variables) is a good thing
We may need to account for pseudo-replication
- We’ll spend a few days talking about it
- – Big problem, easy to do without knowing it

Question 6

Q

Example: An ANCOVA

Answer

A

The simplest multivariable model to understand
- Analysis of Covariance

Other definitions you’ll need to know in the literature:

Covariate: a continuous x
Factor: a categorical x

Typically has one continuous x and one categorical x
- y is usually still error normally distributed

BIG EXAMPLE
Continuous X: Age
Categorical X: Sex
Y: Size

We might have something that looks like [dot graph]
- ?Dif colors for male/female dots?

Size = B0 + B1Age + B2Sex + E
Dummy code: 1 male, 0 female

WE DO NOT HAVE TO RUN THIS MULTIVARIATE MODEL

We could analyze the effects separately (of age and sex)
We’ll still get a good estimate
Beta is odd, between the two,
And sigma is large: it has to capture noise

But, if we run the multivariable model, we essentially get 2 equations
- 1 for females

Size Equation for Females
female size to age
B0 + B1Age + E

B0 = females at age 0
B1= slope of age/size relationship

Size Equation for Males
(B0 + B2) + B1Age + E
- Beta 2
- The slope doesn’t change. The only thing that changes is our intercept

IMPORTANT
When you run the multi model, the mean of the betas hasn’t changed from as if we ran uni-variable models
- still how much size changes for each unit change in age

What would be mean of beta if:

uni-variable (GLM): difference between males and females
multi variable: STILL just the difference

The mean of the beta hasn’t changed AT ALL

What has changed is SIGMA

Gives us a MUCH smaller error
So p-values and confidence intervals will go down

That’s the essence of SWAMPING

if we ran the first model, we might not actually catch the effect of sex because there’s SO much error in the system
– Because there’s so much overlap in the data
– If we hadn’t added SEX to the just size to age graph, you wouldn’t explain a CRAZY amount of noise

What if these lines aren’t parallel?

I don’t want to answer that for at least a week
– Because then you have an interaction, and it complicates things so much
—- We want to always ASSUME these lines are parallel unless we have significant evidence that they are not?

Question 7

Q

Let’s do this in R

Answer

A

Data in the syllabus

Look at it in Excel first

Col 1: Age

2: Sex
3: SexN
4: Error
5: Size (truth)

And error is 1

Question 8

Q

In R

Answer

A

datum=read.csv(file.choose())
head(datum)

plot(Size~Age,data=datum)

results=lm(Size~Age, data=datum)
summary(results)

Question 9

Q

What he goes over in the summary(results)

Answer

A

If we don’t care about the sex

Estimate of truth
Standard error
Confidence interval (2* standard error)
P-value
The KEY: R-squared: 98% of variation is driven by age

Question 10

Q

THEN he does size as a function of sex, SAME data

Answer

A

results=lm(Size~Sex,data=datum)
summary(results)

Estimate is good, but significance?
Confidence interval
P-value
R-squared

We KNOW there’s a variation in the data depending on sex BECAUSE WE MADE THE DATA

Why didn’t we catch it?
There’s so much noise

Question 11

Q

So let’s do it as ANCOVA

Answer

A

results=lm(Size~Sex+Age,data=datum)

Confidence interval (went from 9 to .5!!!)
P-value
Standard error

ALL we did was add age into the model

A proto-typical example of swamping
All that noise caused by age has been explained
So consequently now we CAN detect the effect of sex

Question 12

Q

Where would you see this

Answer

A

As a scientist, I think there might be this one factor that people hadn’t studied yet

You’ve eliminated your way out of some of that noise

If there is some data that you don’t CARE about because other people have tracked it, STILL COLLECT THAT DATA
– Because if you don’t, it drives your results into CRAZY, insignificant directions

Question 13

Q

Something else kind of important, tangential to the base topic but important to understand

Answer

A

Anyone use SAS before?

When you run a model like this in SAS, it always gives you 2 results
– Type III sum of squares
– Type I sum of squares

It’s important EVEN IF you never use SAS
- You need to understand the dif b/t type 1 sum of squares and type III sum of squares

Question 14

Q

Important take away

Answer

A

In this type, ORDER MATTERS

Question 15

Q

Important take away

Answer

A

In this type, ORDER MATTERS

Question 16

Q

I’m telling you because

Answer

A

If you do an anova table of “results”,

You CAN get a type III anova table specialty function

Question 17

Q

KEys

Answer

A

6 Reasons we do multivariable models

- What swamping is and why it’s important that you run a multivariable model when it’s there