Analysing data Flashcards

1
Q

Greek symbols

A

population mean: µ
sample mean: ̅x
population mean estimate: μ ̂
SD: o-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

normal distribution

A
  • bell curved
  • peak is its mean
  • mean median mode same value
  • centring; changing mean, shifting curve left/right
  • SD determines steepness of curve
  • scaling; changing SD
  • 68.2% of data within +/- 1 SD of mean
  • 95.4% of data within +/- 2 SD of mean
  • 99.7% of data within +/- 3 SD of mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

critical values

A

if sd is known, can calculate critical value fro any proportion of normally distributed data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

sampling from distributions

A
  • collecting data on variable includes randomly sampling from distribution
  • underlying distribution assumed to be normal
  • some variables may come from other distributions; log normal distribution, poisson distribution, binomial distribution
  • sample statistic differ from pop
  • sampling distribution centred around population mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

standard error

A

standard deviation of sampling distribution
estimated from any sample
SE = SD/ square root of N
gauge accuracy of parameter estimate in sample
smaller SE, more likely parameter estimate is close to population parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

central limit theorem

A
  • sampling distribution of mean is approximately normal, true no matter shape of population distribution
  • as N gets larger, sampling distribution of sample mean tends towards normal distribution
  • mean = µ, SD= SD/square root of N
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

point estimates

A
  • singel numbers that are best guesses about corresponding population parameters
  • central tendency, measures of spread
  • relationships between variables can be expressed using point estimates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what does SE of mean express?

A
  • uncertainty about relationship between sample and population mean
  • sample mean is best estimate of population mean, true for all point estimates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

interval estimates

A
  • communicate uncertainty around point estimate

- indicates how confident can be that estimate is representative of population parameter

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

confidence interval (CI)

A
  • using SE and sampling distribution to calculate CI with certain coverage
  • 95% CI, 95% of intervals around sample estimate will contain value of population parameter
  • 95% of sampl. distr. within +/- 1.96 SE, 95% CI estimate pop. mean is mean +/- SE
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

t-distribution

A
  • when don’t know sampling distribution
  • symmetrical and centred around 0
  • shape changes based on degrees of freedom
  • ‘fat tailed’ when df=1; identical to normal dist. when df=infinite
  • as df increases, tails get thinner
  • critical value changes based on df
  • df= N-1 (n is number of estimated parameters)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

what you need to calculate confidence intervals

A

estimated mean
sample SD
N
critical value fro t-distribution with df = N -1

-95% CI around estimated pop. mean is mean +/- SE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

CI’s are useful :

A
  • width of interval tell us about how much we expect mean of different sample of same size to vary from one we got
  • x% chance that any x% CI contains true population mean
  • can be calculated for any point estimate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

hypothesis

A
  • statement about something in terms of differences or relationships between things/people/groups
  • must be testable
  • about a single thing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

levels of hypotheses

A
  • conceptual: expressed in normal language on level of concepts/constructs
  • operational: restates conceptual hypothesis in terms of how constructs are measured in given study
  • statistical: translates operational hypothesis into language of mathematics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

operationalisation

A
  • process of defining variables in terms of how they are measured
  • intelligence as total score on Ravens progressive matrics
17
Q

Statistical hypothesis

A
  • operational hypothesis in terms of language of maths
  • deals with specific values of population parameters
  • mean of population can be hypothesised to be of given value
  • can hypothesise a difference in means between two populations
18
Q

problems with samples that test hypothesis

A

not representative of population
larger the sample the better as fluctuations become less important as N increases
means converge to true value of population mean as N increases
CIs get exponentially smaller with N

19
Q

null hypothesis

A

states there is no difference

used to test for statistical significance

20
Q

distribution of test statistic under Ho

A

even if true difference in population delta is zero, D can be non-zero in sample
Assume A is normally distributed in population with µ=0 and o- = 1, expected value of D under Ho, more often than not D will not equal to 0 in sample

21
Q

what is a p-value

A
  • the probability of getting test statistic at least as extreme as one observed if null hypothesis is true, how likely data is if there is no difference/effect in population
  • if p-value is less than chosen significance level, call result statistically significant
22
Q

retain or reject null

A

reject null hypothesis when judge our result to be unlikely under Ho
retain Ho if judge result to be likely under it

23
Q

continuous data

A
  • matter of degree eg how much
  • score or measurement
  • makes sense to have mean value
24
Q

categorical data

A
  • matter of membership eg which group?
  • group or label
  • membership is binary
25
for each statistical analysis we need:
data test statistic distribution of test statistic probability of value of test statistic uder null hypothesis
26
correlation
- quantifies degree and direction of numeric relationship - used wtih two or more continuous variables or if one is categorical - use pearson correlation coefficient - only use correlated when reporting r as evidence
27
what code in r is used to get pearsons correlation
data %>% select(variable, variable) %>% cor(method = 'pearson')
28
what can you suggest when confidence intervals overlap
they may have same population value
29
chi squared test
- quantifies relationship between two or more categorical variables - compare what might expect under null and calculate X^2 to quantify - only use X^2 if value greater than 5 in each cell - the bigger the X^2 value the bigger the difference between our data and what we expect
30
important note about chi squared
only test significance of null hypothesis being true, there will be no evidence for alternative
31
using t distribution
- t is the difference in sample means compared to standard error of differences in means - larger the t the bigger the difference bewteen sample means compared to error
32
t and r
- p value from r comes from t distibution - can change t into r - t quantifies difference in means between two groups - R quantifies degree and direction of relationship between two variables
33
what is a predicor
variable that may have relationship with outcome
34
what is an outcome
variable we want to explain | outcome = model + error
35
linear model
- creates linear model between outcome variable and predictor variable in dataset - look at lm() %>% summary() - R^2 is variance of variable A was explained by variable B - adjusted R^2 is if applied same model to population - R^2 and adjusted R^2 must be similar and big
36
r code for linear model
Lm(outcome ~ predictor, data = data)
37
equation of linear model
outcome = b0 + b1 x PREDICTOR1 + error
38
f statistic
F = (what model can explain)/(what cant explain) - ratio of variance explained relative to variance unexplained - ratio > 1 means model can explain more than it cant - associated p value of how likely to find F stat as large as observed if null is true
39
how to compare linear models
- compare R^2 and change in R^2 - compare f stat and its associated p-value - look at standardised versions of b1