7. Sept 7 Flashcards
Quiz
Different between confidence interval and prediction interval?
- The confidence int indicates the range over which the average value might occur, whereas the prediction interval estimates individual values
- Averages vs individual data
Regression assumption violated
- Non-normality (skewed up)
Assumption violated
- Auto correlation (up and down
Assumption violated
- Homoscedasticity
Following describes making predictions about a response y outside the observed range of x values used in your analysis
- Extrapolation
Example in categorical
You want to know if males and females of particular species are of the same size.
- Males turn out to be 1kg
- Females turn out to be 0.8kg
- – Those are averages, so there’s individual variation
Binomial X and normally distributed Y
- Normally done with a t-test
You can ALSO analyze this with a regression
- Yi = B0 + B1X + E-N(0,s)
How do you use that equation if your X variable is sex (male/female), a WORD?
- Use “dummy coding”
- Process of assigning 0’s and 1’s to categorical variables in order to convert them to math
- AKA converting categorical variables to numbers using 0s and 1s
What if I wanna know the average size for females?
- Y = B0 + Error
Example in categorical
You want to know if males and females of particular species are of the same size.
- Males turn out to be 1kg
- Females turn out to be 0.8kg
- – Those are averages, so there’s individual variation
Binomial X and normally distributed Y
- Normally done with a t-test
You can ALSO analyze this with a regression
- Yi = B0 + B1X + E-N(0,s)
How do you use that equation if your X variable is sex (male/female), a WORD?
- Use “dummy coding”
- Process of assigning 0’s and 1’s to categorical variables in order to convert them to math
- AKA converting categorical variables to numbers using 0s and 1s
What if I wanna know the average size for females?
- Y = B0 + Error
For males
y = B0 + B1X
What is the best estimate of the difference in size between the males and females?
- B1 tells us the difference between groups?
- – But wait, it’s the slope. How can a slope be helpful
B1 is both the slope AND the difference between groups
- How can it be both?
- Run is zero, rise is diff between the groups
EVEN THOUGH this is a t-test, it still works within the regression
How this works in R
Creating the data in Excel
Have to create dummy-coded variable
- 1 for all the males
- 0 for the females
I get the same value whether I use the regression equation, or I use the averages of those groups
plot(Mass~Sex,data=datum)
- gives us box and whiskers
- Anytime you give R categorical X, it gives box and whiskers
T Test in R
results=t.test(Mass~Sex, data=datum, var.equal=TRUE)
- Weird thing: no summary function to t-test
- So just call results
- We can assume or not assume that the variance is homoscedastic
Test as LM
Exact same results as the t-test
results3=lm(Mass~Sex,data=datum)
We didn’t even have to make X 1
Why did it choose females as a reference group (instead of males)? Females come first alphabetically
There’s also a re-level function to call within lm() that lets you choose which is reference data
Key Point
- How to run a t-test in R (and that you can get the same results with lm())
- It works with this dummy-coding process
Making more complicated
Categorical X with 3 categories instead of two
- Males, Females, Hermaphrodites
- We need TWO new dummy-coded variables,
Male gets 1 for male, 0 for hermaphrodite, 0 for female
Female gets equivalent
Hermaphrodite gets equivalent
Only need N-1 dummy-coded variables for N variables
Yi = B0 + B1Males + B2Hermaphrodite + E
- B0 is the average y when all X’s are zero (in this case, size of females)
- B1 here is difference between males and females (or more generally, the reference group)
- B2 is difference between hermaphrodites and females
- BUT it doesn’t give us the difference between males and hermaphrodites (we’d have to change the reference)
Making in excel
Column 1
Contains males, females, & hermaphrodites
Column 2
Mean mass for each
C3
Error
c4
Mean mass plus error
C5, 6, 7
dummy-coded error
For all 3 separately
C8
Mass calculated with dummy coding
How to analyze this data
Can’t use t-test (only for 2 groups)
We run an ANOVA
results=aov(Mass~Sex,data=datum)
All this does is tell you that 2 groups are different from each other
- Doesn’t tell us which, how…
NOTE: You can get the ANOVA information from your summary function
Change the reference
We need to change the reference to compare males to hermaphrodites (in this example)
results=lm(Mass~Females + Herm, data=datum)
There is a coding way to change the reference
results=lm(Mass~Herm+Females+Males,data=datum)
- That doesn’t give a reference point. R automatically uses the last variable
Relevel function
results=lm(Mass~relevel(Sex, ref=”Males”),data=datum)
Main takeaways
- Ways to run t-test and ANOVAs in R
- You should be able to do it easily
- Understand how the dummy coding works (even though R does it for you)
- – You REALLY need to know how it works
These t-tests are not technically valid. They are liable to make a type 1 error.
- Because we did TWO tests