Analysing data Flashcards
Greek symbols
population mean: µ
sample mean: ̅x
population mean estimate: μ ̂
SD: o-
normal distribution
- bell curved
- peak is its mean
- mean median mode same value
- centring; changing mean, shifting curve left/right
- SD determines steepness of curve
- scaling; changing SD
- 68.2% of data within +/- 1 SD of mean
- 95.4% of data within +/- 2 SD of mean
- 99.7% of data within +/- 3 SD of mean
critical values
if sd is known, can calculate critical value fro any proportion of normally distributed data
sampling from distributions
- collecting data on variable includes randomly sampling from distribution
- underlying distribution assumed to be normal
- some variables may come from other distributions; log normal distribution, poisson distribution, binomial distribution
- sample statistic differ from pop
- sampling distribution centred around population mean
standard error
standard deviation of sampling distribution
estimated from any sample
SE = SD/ square root of N
gauge accuracy of parameter estimate in sample
smaller SE, more likely parameter estimate is close to population parameter
central limit theorem
- sampling distribution of mean is approximately normal, true no matter shape of population distribution
- as N gets larger, sampling distribution of sample mean tends towards normal distribution
- mean = µ, SD= SD/square root of N
point estimates
- singel numbers that are best guesses about corresponding population parameters
- central tendency, measures of spread
- relationships between variables can be expressed using point estimates
what does SE of mean express?
- uncertainty about relationship between sample and population mean
- sample mean is best estimate of population mean, true for all point estimates
interval estimates
- communicate uncertainty around point estimate
- indicates how confident can be that estimate is representative of population parameter
confidence interval (CI)
- using SE and sampling distribution to calculate CI with certain coverage
- 95% CI, 95% of intervals around sample estimate will contain value of population parameter
- 95% of sampl. distr. within +/- 1.96 SE, 95% CI estimate pop. mean is mean +/- SE
t-distribution
- when don’t know sampling distribution
- symmetrical and centred around 0
- shape changes based on degrees of freedom
- ‘fat tailed’ when df=1; identical to normal dist. when df=infinite
- as df increases, tails get thinner
- critical value changes based on df
- df= N-1 (n is number of estimated parameters)
what you need to calculate confidence intervals
estimated mean
sample SD
N
critical value fro t-distribution with df = N -1
-95% CI around estimated pop. mean is mean +/- SE
CI’s are useful :
- width of interval tell us about how much we expect mean of different sample of same size to vary from one we got
- x% chance that any x% CI contains true population mean
- can be calculated for any point estimate
hypothesis
- statement about something in terms of differences or relationships between things/people/groups
- must be testable
- about a single thing
levels of hypotheses
- conceptual: expressed in normal language on level of concepts/constructs
- operational: restates conceptual hypothesis in terms of how constructs are measured in given study
- statistical: translates operational hypothesis into language of mathematics
operationalisation
- process of defining variables in terms of how they are measured
- intelligence as total score on Ravens progressive matrics
Statistical hypothesis
- operational hypothesis in terms of language of maths
- deals with specific values of population parameters
- mean of population can be hypothesised to be of given value
- can hypothesise a difference in means between two populations
problems with samples that test hypothesis
not representative of population
larger the sample the better as fluctuations become less important as N increases
means converge to true value of population mean as N increases
CIs get exponentially smaller with N
null hypothesis
states there is no difference
used to test for statistical significance
distribution of test statistic under Ho
even if true difference in population delta is zero, D can be non-zero in sample
Assume A is normally distributed in population with µ=0 and o- = 1, expected value of D under Ho, more often than not D will not equal to 0 in sample
what is a p-value
- the probability of getting test statistic at least as extreme as one observed if null hypothesis is true, how likely data is if there is no difference/effect in population
- if p-value is less than chosen significance level, call result statistically significant
retain or reject null
reject null hypothesis when judge our result to be unlikely under Ho
retain Ho if judge result to be likely under it
continuous data
- matter of degree eg how much
- score or measurement
- makes sense to have mean value
categorical data
- matter of membership eg which group?
- group or label
- membership is binary
for each statistical analysis we need:
data
test statistic
distribution of test statistic
probability of value of test statistic uder null hypothesis
correlation
- quantifies degree and direction of numeric relationship
- used wtih two or more continuous variables or if one is categorical
- use pearson correlation coefficient
- only use correlated when reporting r as evidence
what code in r is used to get pearsons correlation
data %>% select(variable, variable) %>% cor(method = ‘pearson’)
what can you suggest when confidence intervals overlap
they may have same population value
chi squared test
- quantifies relationship between two or more categorical variables
- compare what might expect under null and calculate X^2 to quantify
- only use X^2 if value greater than 5 in each cell
- the bigger the X^2 value the bigger the difference between our data and what we expect
important note about chi squared
only test significance of null hypothesis being true, there will be no evidence for alternative
using t distribution
- t is the difference in sample means compared to standard error of differences in means
- larger the t the bigger the difference bewteen sample means compared to error
t and r
- p value from r comes from t distibution
- can change t into r
- t quantifies difference in means between two groups
- R quantifies degree and direction of relationship between two variables
what is a predicor
variable that may have relationship with outcome
what is an outcome
variable we want to explain
outcome = model + error
linear model
- creates linear model between outcome variable and predictor variable in dataset
- look at lm() %>% summary()
- R^2 is variance of variable A was explained by variable B
- adjusted R^2 is if applied same model to population
- R^2 and adjusted R^2 must be similar and big
r code for linear model
Lm(outcome ~ predictor, data = data)
equation of linear model
outcome = b0 + b1 x PREDICTOR1 + error
f statistic
F = (what model can explain)/(what cant explain)
- ratio of variance explained relative to variance unexplained
- ratio > 1 means model can explain more than it cant
- associated p value of how likely to find F stat as large as observed if null is true
how to compare linear models
- compare R^2 and change in R^2
- compare f stat and its associated p-value
- look at standardised versions of b1