Data analysis Flashcards

1
Q

What does N1= n/(1-d)

A

Equation to work out necessary sample size accounting for drop out rate

N1 = adjusted sample size
n= required sample size
d= drop out rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are descriptive statistics

A
  • summarise data
  • can show averages and spread
  • can show associations and correlations
  • e.g tables, graphs and numbers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are inferential statistics?

A
  • make inferences about population from the sample
  • shows how likely results are due to chance
  • can provide strength of evidence
  • e.g estimates and hypothesis testing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two different types of statistics?

A

Descriptive and inferential

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a variable and what are the two main categories of variable?

A

A characteristic that can be measures and that can assume different values

  • categorical (refers to non quantifiable characteristic)
  • numeric (quantifiable characteristic- values are numbers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the different types of categorical variable?

A

Nominal- describes names, labels or categories that can’t be ordered e.g colours, gender or yes/no (binary)

Ordinal- clear ordering of categories e.g level of education, social class or a scale of disagree to agree

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the different types of numeric variables?

A

Discrete- countable, measured on a continuum e.g money, age in years or number of cars

Continuous- measured numerically, infinite number (degree of accuracy) e.g height, time, distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What’s ratio data?

A

Continuous variable with a meaningful zero point. An arbitrary zero is if you can have negative values

E.g distance and height and temperature in kelvin (not Fahrenheit or Celsius as can have negative of those)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are independent and dependent variables?

A

Independent variable- the cause (intervention), value is independent of other variables in the study

Dependent variable- the effect (outcome), value changes depending on value of the independent variable (and maybe confounding/extraneous variables)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do variables relate to observations?

A
  • if you collected three pieces of info about each member of a group of participants the different peices of info of all would be a variable, e.g if you asked everyone’s favourite colour the favourite colour would be a variable
  • the observation would be all three bits of information about an individual participant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are pros and cons of bar charts?

A
  • used to compare groups
  • can be used to track changes over time
  • single bar charts can’t see how variables compare to anything else
  • can have bar charts showing multiple variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What can stacked bar charts be used for?

A
  • to find the relative decomposition of each primary bar based on a second variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the key features of bar charts?

A
  • use categorical data (counts)
  • each bar is proportional to the value they represent
  • equal space between bars
  • X-axis could be anything
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do histograms show and their key features?

A
  • shows frequency distribution
  • data grouped into continuous data ranges
  • each range corresponds to a bar
  • no space between bars (continuous)
  • X-axis should represent continuous numerical data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What can line graphs do?

A
  • track changes over time
  • track multiple variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What can scatter plots show?

A
  • trends/correlation (relationship between variables) (and strength of the relationship)
  • can examine outliers
  • can draw line of best fit
  • each point is one observation

Positive correlation- variables increase or decrease together

Negative correlation- as one variable increases the other decreases

No correlation- no clear relationship between variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can scatter plots show the strength of the relationship of two variables?

A
  • how close the points are to the line of the best fit can show the strength the relationship
  • can measure the strength of the relationship using r
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How does r show the strength of a relationship between two variables?

A

R ranges from 1 to -1

A value of 0 shows no correlation

A value of -1 shows perfect negative correlation

A value of 1 shows perfect positive correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What’s the relationship between correlation and causation?

A

Correlation does not equal causation.

With correlation it’s not known the direction of which variable is influencing the other (dependent vs independent) or if the relationship is caused by a confounding variable.

To be causation it has to be known the direction of which variable if causing the other and that the relationship isn’t caused by a confounding variable (or coincidence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What’s a confounding variable?

A

An unmeasured variable that influences the variables under investigation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the two main categories of descriptive statistics

A

Measures of central tendency- averages e.g mean median and mode

Measures of dispersion-
spread e.g range/inter quartile range, standard deviation and variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the different measures of central tendency and why do you need different ones?

A
  • mean is the sum of all the values divided by the number of values
  • median is the middle value in an ordered data set and useful when there are outliers or data is skewed
  • mode is the value that occurs the most in the data set

Need all three as they all are more applicable in different situations e.g median over mode if lots of outliers or data is skewed otherwise it won’t be representative of the middle of the data set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What might a very different median and mean show?

A

Might mean that the data has lots of outliers or is skewed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What would you see with averages of perfectly symmetrically distributed data?

A

Mean, median and mode would be the same

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is normal distribution?

A

A frequency curve often formed naturally by continuous variables e.g height.

A histogram of normally distributed data would show a bell shaped curve

26
Q

What are the different types of skewed distribution and how do they effect the averages?

A
  • negative skew, long left tail on histogram mean < median < mode
  • positive skew, long right tail on histogram mean > median > mode
27
Q

What’s range?

A
  • measure of dispersion (spread)
  • max value of data set - min value
  • small value means data values are closer together
  • heavily impacted by outliers
28
Q

What are the different measures of dispersion (spread)

A
  • range
  • interquartile range
  • standard deviation
  • variance
29
Q

What’s interquartile range?

A
  • divide data into quartiles (4 parts of rank- ordered data set)
  • IQR = Q3- Q1

Q1= middle of first half of data
Q2 = median
Q3 = middle of second half of data

  • less impacted by data outliers
30
Q

What does a small interquartile range show?

A
  • means data is less spread out
  • however is impacted by sample size- if have smaller sample size more likely to have smaller range
31
Q

How can you graphically show IQR?

A
  • box and whisker plot
  • also shows and can compare median
  • can also show correlations
  • useful for comparing
32
Q

What’s standard deviation?

A
  • only can be used for normally distributed data
  • measures how far each observed value is from the mean
  • calculated variability within a sample
  • descriptive statistic
33
Q

What does high and low standard deviation show?

A

High = indicates data is more spread out

Low = shows data is clustered around the mean

34
Q

How can SD be used as a unit of measurement

A
  • 68% of data in a set will be within one standard deviation (above and below) of the mean
  • 95% will be within 2
  • 99.7% will be within 3
35
Q

What’s the standard error of the mean

A
  • inferential statistic
  • can only be used for normally distributed data
  • can only be estimated unless true population parameter is know (v. Unlikely)
  • also called standard error
  • shows how much the sample mean would vary if you were to repeat the study with a new sample (variability across multiple samples)
  • indicates how different population mean is likely to be from sample mean (how well the sample data represents the population)
36
Q

What does high and low SE show?

A

High = sample means are widely spread, sample may not closely represent population

Low = sample means are closely distributed around the population mean, sample is representative of population

37
Q

What can be used to describe normally distributed and skewed data?

A

Normally distributed- mean, SD, SE, range (if SD and SE are reported can assume data is normally distributed)

Skewed- median and IQR

38
Q

What can descriptive statistics be used for?

A
  • describe data, it’s shape and spread
39
Q

What can inferential statistics be used for?

A
  • make assumptions about data from sample to tell us something about the population
  • tell us how likely results are due to chance
  • can provide strength of evidence
40
Q

What’s incidence?

A

Number of new cases of a disease/injury condition at/during a given time

Number of new cases (during time)/person years at risk (during same time)

Person years= total amount of time each person of the population is at risk of disease during the period

41
Q

What’s prevalence?

A

Number of all cases of a disease/injury/condition at a given time

Number of cases in pop at one time/ total population at the same point in time

42
Q

What are risk and odds?

A
  • both measures of association between exposure and outcome

Risk - probability an outcome will occur

Odds - ratio between probability an outcome will occur and probability that it will not occur

43
Q

What’s relative risk and what do the values mean?

A

Relative risk (RR) = ratio of risk of event in treatment group vs risk of event in control group

RR > 1 = event is more likely with treatment

RR = 1 = no difference of event occurring with or without treatment

RR < 1 = event is less likely with treatment

44
Q

What’s absolute risk?

A

The risk ratio is the absolute risk. (Risk with treatment but not including risk of it occurring without treatment)

The increased or decreased risk is the difference between the AR and 100%- number of cases with treatment extra or less than one occurring naturally.

45
Q

What’s odds ratio (OR)

A

Ratio of odds of an event in treatment group to the odds of an event in the control group.
Odds of event actually happening not risk of it happening.

OR= odds of event in treatment group/odds of event in control group

(a+b) / (c+d)

46
Q

What do the odds ratio values actually mean?

A

OR > 1. Odds of event among treatment group is greater than odds of event among control event. Treatment might be causing event- increases odds of event happening.
If an exposure might be a risk factor for outcome.

OR = 1. Odds of event same among treatment and control so treatment doesn’t increase odds of event

OR < 1. Odds of event among treatment lower than among control. Treatment reducing odds of event happening. (If an exposure could be a protective/preventative factor)

47
Q

How does OR relate to increased odds?

A

Increase/decrease in odds of an outcome is difference between OR and 1.
Similar to increase/decrease in risk compared to AR/RR

48
Q

Relationship between samples and probability.

A

If took several samples from same population they will vary from each other and the population (sampling error)

Sample population unlikely to be same as true/population probability.

Larger the sample the closer the sample probability will be to true/population probability l.

49
Q

Relation of chance to sample/population

A

Make inferences about population from sample but always possibility that effect is due to chance. (Why correlation doesn’t equal causation)

50
Q

What’s hypothesis testing, p values and alpha?

A

Hypothesis testing is when a hypothesis is created (the alternative hypothesis H1) e.g treatment A is better than treatment B. And then a null hypothesis (H0). And than experiment is conducted to collect data relevant to the hypothesis. Then can make a determination about the hypothesis given how likely it is our data is given the hypothesis.

The p value is the probability of the results occurring if the null hypothesis and can give information about wether to accept or reject the null hypothesis and if results are just due to chance.

Alpha is an arbitrary cut off used to decide the level the p value needs to meet the reject the null hypothesis (can be any number often 0.05 but can be 0.01 or 0.001 etc)

51
Q

How does the P value and alpha relate to rejecting the bill hypothesis?

A

If the p value is below alpha then can reject null hypothesis and accept alternative hypothesis. Shows it’s unlikely results are just due to chance.

If the p value is greater than alpha then the null hypothesis is accepted (and alternative rejected) and it shows the results are likely due to chance.

52
Q

Key points to do with p-values and papers.

A

P- values shouldn’t be used to arbitrarily decide results into significant and non-significant (show more or less significance)

Papers should provide a precise p value not just how it relates to the arbitrary alpha value.

The smaller the p value the stronger the evidence to reject the null hypothesis.

53
Q

What do tables showing measures of correlation show.

A

Variables and r values

54
Q

What are pvalues

A
  • probability of results occurring if null hypothesis is true
  • a measure of the strength of evidence against the null hypothesis (for the alternative hypothesis)
  • a measure of how likely it is results are due to change (how significant they are)
  • shows likelihood that result could have occurred under the null hypothesis. How likely you are to obtain the result if the null hypothesis is true.
55
Q

What’s a confidence interval?

A

A range of values that we are fairly sure contains the true value.

(Can only be fairly sure not certain unless true value is known, which it normally isn’t/impossible to calculate)

Usually more useful than a p-value

To do with standard deviation (spread of results)

Represented by symbol Z

56
Q

What’s a true value?

A

Not the value estimated from a sample but the value you would get if you could sample/use the entire population (often can’t)

57
Q

What does a small CI mean and how does it relate to sample size?

A

The smaller the CI means the more accurate the mean.
The bigger the sample size the smaller the CI.

58
Q

What is CI in figures?

A
  • error bars
  • sometimes called confidence limits
  • good way of visualising uncertainty in points plotted on a graph
59
Q

What do overlapping error bars mean?

A

Means true value could overlap so there may not actually be a difference or the difference could even be opposite to suggested.

60
Q

What is attributable risk?

A

The excess risk of an event occurring due to a risk factor