Discovering statistics Flashcards
what does SPINE of statistics stand fro
standard error parameter interval estimates null hypothesis testing estimation
general linear model
outcome = b0 + b1(predictor) + error
chi sqaured test
chisq.test(data$variable, data$variable, correct = FALSE)
for categorical or count data
spearman correlation
data %>% correlation::correlation(., method = “spearman”)
continuous data
what do the parts of GLM stand for
b0 = estimate value when predictor=0 b1 = represents difference in means if linear model has two categorical groups bn = estimate of parameter for predictor, direction/strength of effect, difference of means
least squared estimation
- when no predictors, we predict them outcome from intercept
- outcome = b0 + e
- b0 will be mean value of outcome in this scenario
- if given data estimate the mean
- rearrange equation -> error = outcome - b0
- square the error and plot
- keep estimating mean
- peak of graph is least squared error
standard error
- frequency distribution -> plot sample mean against frequency
- smaller sampling distribution means smaller SD but is called standard error
central limit theorm
majority of scores around mean
normal distribution
1.96 sd from mean contains 95% data
confidence intervals
express estimates as intervals such that we know population value lies in them
95% chance contains true pop parameter
interpreting parameter estimates
raw effect size is the beta estimate
standardised effect size fits model to raw data that are z -scores (expressed in standarised scores)
long run probability: parameters represent effects
relationships between variables
differences in means
long run probability: parameters reflect hypotheses
h0 : b = 0, b1 = b2
h1 : b =/= 0, b =/= b2
long run probability: test statistic
t= b/SEb
can work out how likely value if null true
value of t on x axis and probability on y
type 1 error
reject null when it is true
believe in effects that dont exist
type 2
accept null when its false
statistical power
probability of test avoiding type 2 error
problems with null hypothesis testing
- not tell importance of effect
- little evidence about null hypoth
- encourages all or nothing
- based on long run probability
problem with long run probability
p is relative frequency of observed test statistic relative to all test statistics from infinite no. of identical experiments with exact same priori sample size
type 1 error rate either 0 or 1
comparing sum of sqaures
- sum of squares represent total error
- only compare the totals when based on same number of scores
illusory truth effect
repetition increases perceived truthfulness
equally true for plausible and implausible statement
SSt
-total variability between mean and scores
-SSm + SSr
-each SSt has associated df
dfT = N-p (p = parameter, N = independent information)
SSr
- total residual/error variability
- error in model
- to get SSr we estimate using ‘two’ parameters
- dfR = N - P, so P is 2
SSm
- total model variability
- improvement due to model
- model rotation of null model
- null and estimated model are distinguished by b1
- dfM = dfT - dfR
mean squared error
- sum/total amount of squared errors depends on amount of information use to compute it
- can’t compare sums as based on different amounts of info
- MSr = SSr/df (average residual error)
- MSm = SSm/df (average model variability)
F statistic
- testing fit
- sig fit represents sig effect of experimental manipulation
- if model results in better prediction than the mean then MSm > MSr
- Anova(model_lm)
testing the model
- R^2 proportion of variance accounted for by model
- pearson correlation coefficient between observed and predicted scores^2
- R^2 = SSm/SSr
- adjusted R^2 estimate of R^2 in population
broom: :glance(data_lm)
how to enter predictors
- hierarchal (experimenter decides)
- forced entry (all entered simultaneously)
- stepwise (only used for exploratory analysis, predictors selected using semi partial correlation with outcome)
influential case
- outliers distort linear model and estimations of beta values
- detect them in: graphs, standardised residual, cooks distance, DF beta statistics
- ggplot::autoplot(data_lm, which = 4, …) + theme_minimal() gives estimate, std.error, p.value and removes outliers
robust estimation
- use of model as can’t remove outliers
- robust::lmRob(outcome~predictor, data = data)
- summary(lm_rob)
key assumptions of linear model
linearity (relationship between predictor and outcome is linear) and additivity (combined effects of predictors)
spherical errors (pop model have homoscedastic errors and independent errors)
normality of errors
errors vs residuals
- model errors refer to differences between predicted values and observed values of outcome variable in POP model
- residuals refer to differences between predicted values and observed of outcome in SAMPLE model
spherical errors
- should be independent
- pop error in prediction for one case should not be related to error in prediction for another case
- errors should be homoscedastic
- violation of assumption
homoscedasticity of errors
variance of pop errors should be consistent at different values of predicted variable
violation of assumption
b’s unbiased but not optimal
standard error incorrect
robust procedures
boostrap -> standard errors derived empircally using resampling technique, designed for small samples, robust b, p, and ci
heteroskedasticity -> consistent SE, uses HC3 or HC4 methods
dummy coding
- code control group with 0 and the other with 1
- b for dummy variable is difference between means of two conditions
- mean condition 1 = b0 + b1(0)
- mean condition 2 - mean condition 1 = b1
- dummy coding isn’t independent as used same p-value
contrast coding model
- outcome = b0 + b1(contrast 1) + b2(contrast 2)
- b0 is value of control
- b1 is difference between b1 and b0
- b2 is difference between b2 and b0
planned contrasts
variability explained by model, SSm, due to participants being assigned to diff groups
variability represents experimental manipulation
what to consider when choosing contrasts
- independent - to control for error 1, if group is singled out in contrast then it shouldn’t be used again
- only contrast 2 chunks of variation
- k-1, end up with one less contrast than no. groups
- first contrast compare control to all experimental ones
rules of coding planned contrasts
1-groups coded with positive weights compared to groups coded negatively
2-sum of weights equal 0
3-if group not used code it as 0
4-initial weight assigned is equal to number of groups in opposite chunk
5-final weight = inital/no. groups with non 0 weight