6. Sept. 5 Flashcards
Quiz
- Stats equation for line
Y = B0 + B1X1 + E-N(0,s)
Y equals beta zero plus beta 1 times x plus error that’s normally distributed with a mean of zero and standard deviation of sigma
Confidence interval correct definition
95% of such intervals contain the true value
What are the 3 characteristics of the data and underlying relationship that influence the significance (p-value) of a regression analysis?
- Sample size, slope(effect size), noise,
Definition of R^2?
Proportion of variation in Y explained by variation in X
5 Assumptions of General Linear Model
- Y is continuous
- Normal distribution of error
- Linear relationship
- Homoscedasticity (constant variance)
- No autocorrelation (lack of interdependence)
How to look for violations of assumptions?
Plot it
In R
datum is non-normal datumNonlin=read.csv(file.choose()) head(datum) datumHetero=read.csv(file.choose()) head(datum) datumAuto=read.csv(file.choose()) head(datumAuto)
plot(Y~X,datum)
Residuals Plot
Useful for each one of these violations
Have to run the analysis
results=lm(Y~X,data=datum)
plot(residuals(results) ~datum$X)
- Pulling from two different objects, so have to use dollar in datum$X to stipulate
— He’s taken the regression line and “flattened it” so it is now the x axis (horizontal)
— You see it’s non-normally distributed because it’s not equally above and below 0
Nonlinear
plot(Y~X, data=datumNonlin)
For non-linear
resultsNonlin=lm(Y~X,data=datumNonlin)
abline(resultsNonlin)
plot(residuals(resultsNonlin)~datumNonlin$X)
Hetero
plot(Y~X,data=datumHetero)
resultsHetero=lm(Y~X,data=datumHetero)
abline(resultsHetero)
plot(residuals(resultsHetero)~datumHetero$X)
Autocorrelation
plot(Y~X,data=datumAuto)
resultsAuto=lm(Y~X,data=datumAuto)
abline(resultsAuto)
plot(residuals(resultsAuto)~datumAuto$X)
Histogram of residuals
A way of looking at normality of residuals in a GLOBAL sense.
There’s 2 ways of thinking about how NORMAL data really is.
- Global - how normal is it around the entire line
- Locally - how normal it is in relation to other points
Important for things like:
If we fit a line through the non-linear
- Non-normally distributed
Also, if you have supposedly TWO violations
You usually just have one BIG one, and the other is not so bad when you fix the first one
Histogram in R
hist(residuals(results))
Best for looking at ____
Bell curve “critosis”?
Where the curve is pinched. Most of it in tiny sliver in middle
ACF
Auto-correlation function
Both a graph AND a statistical test
datumNorm=read.csv(file.choose())
plot(Y~X,data=datumNorm)
resultsNorm=lm(Y~X,data=datumNorm)
abline(resultsNorm)
acf(residuals(results) [order(datum$X)])
- – Still need things to be in X order, but we can’t technically use a tilde because it’s not a ____
- Have to put in the brackets
- Data has to be in order spatially or temporally, and that happens with the order command
- – Not necessary if your data was already [in such and such order]
- – It will show you that a point is perfectly correlated with itself (x = 0, y = 1.0 for perfect correlation)
You’re looking for correlation within the first couple (1-4). Past 10 you really don’t care much.
acf(residuals(resultsAuto) [order(datumAuto$X)])
Nonlinear ACF
acf(residuals(resultsNonlin) [order(datumNonlin$X)])
LOOKS like it’s autocorrelated, but it’s related LOCALLY. There is a consistent amount of brethren points close to it because they are clustered such
SO, ACF is only good for autocorrelation.
- BUT remember it will give a strong signature if you have non-linearity.
SHIFT GEARS from assumptions to predictions
One of the real values of statistics is you can use them to make PREDICTIONS.
- You can easily use the GLM to make predictions