3. Aug 24th Flashcards
Can the null hypothesis be true?
Yes, in a controlled manipulative study.
What are multiple types of regression tests an example of?
The general linear model
- The most important topic of this entire class
What is the simplest form of the general linear model?
Simple linear regression
- Continuous x, continuous y
What are the two purposes of regression?
- Fit a line to data
- Test if the slope of that line is significant
- – If p-value < 0.05
What are the three possible causes of a large p-value?
If the p-value > 0.05, we don’t know if:
A) Sample size is too small
B) Effect size is too small
C) Too much noise
The traditional equation for a line
y = mx + b
b is the y intercept
m is the slope (change in y value/change in x value)
The stats equation for a line
General
Y = Beta-0 + Beta-1(X)
B0 is a constant
B1 is regression coefficient
Specific
y-hat = Beta-0 + Beta-1(X)
y-hat = predicted value of dependent variable
What two elements make up the relationship between every x and y (i.e. the line)?
Our equation PLUS Epsilon
E = error (normally distributed)
What two requirements must be met to be a “best fit line”?
- Average error = 0
- – (The distance from each point to the line added up)/n = 0 - The sum of squared error is minimized
- – Squaring to get rid of negatives
- – Complicated models are done iteratively
What is the more technical name for “total variance in y”?
Total Sum of Squares
Total Sums of Squares (SST): summed square of distance from data to null hypothesis line (0 slope)
Represents ALL the information we have
SST = SSR + SSE
Sigma(n top, i=1 bottom) (Yi - y-bar)^2
Yi - any individual Y (observed dependent variable)
Y-bar- average/mean y
What 2 things does the total sum of squares partition variation into?
https://365datascience.com/sum-squares/
- SSR- sum of squares due to regression
— The variation in y due to variation in x
Sigma(n top, i=1 bottom) (Yi-hat - y-bar)^2
Yi-hat = Your predicted value of y
Y-bar = average/mean of Ys AKA mean of the dependent variable - SSE- sum of squares due to error
— Noise
Sigma(n top, i=1 bottom) (ei^2)
e = difference between the observed value and the predicted value
What type of p-value will you get if sum of square error (SSE) is larger than sum of squares regression (SSR)?
A large p value (greater than 0.05)
What type of p-value will you get if sum of squares regression (SSR) is greater than sum of squares error (SSE)?
A smaller p-value (smaller than 0.05)
Also means that movement in y is mostly due to x
Important take aways of regression
1) Best fit lines mean
- –a) Average error = 0
- –b) Minimize sum of squares error (SSE)
2) P-values are calculated by partioning variation in y into
- – Sum of squares regression (SSR)
- – Sum of squares error (SSE)
What does regression display?
Correlation
NOT causation
What DO you have to do to determine causation?
A MANIPULATIVE experiment.
Observational studies will not prove causation.
This is the VERY reason that debates about global warming even exist.
— We don’t have 3 Earths to manipulate. We cannot do manipulative studies.
How can you test to be absolutely sure your R data results are correct?
Make the data yourself.
What does the professor not believe in?
“I don’t believe in randomness. Randomness is just other measurable things we haven’t measured.”
How to load data to R
1) datum=read.csv(file.choose())
2) head(datum)
How to plot in R
plot(Y~X,data=datum)
How to run almost any regression in R
lm(same inside as plot)
What does he recommend you save your results as?
results
results=lm(Biomass~Rainfall, data=datum)
summary(results)
What single function gives you most of the data you’re looking for in your results?
summary()
What is R^2?
The proportion of variation in y explained by x
If:
R^2 = 1, all points are on the regression line. Perfect fit.
R^2 = 0, no slope, no points on the line
R^2 always goes up when you add x-values