Data analysis Flashcards
What does N1= n/(1-d)
Equation to work out necessary sample size accounting for drop out rate
N1 = adjusted sample size
n= required sample size
d= drop out rate
What are descriptive statistics
- summarise data
- can show averages and spread
- can show associations and correlations
- e.g tables, graphs and numbers
What are inferential statistics?
- make inferences about population from the sample
- shows how likely results are due to chance
- can provide strength of evidence
- e.g estimates and hypothesis testing
What are the two different types of statistics?
Descriptive and inferential
What is a variable and what are the two main categories of variable?
A characteristic that can be measures and that can assume different values
- categorical (refers to non quantifiable characteristic)
- numeric (quantifiable characteristic- values are numbers)
What are the different types of categorical variable?
Nominal- describes names, labels or categories that can’t be ordered e.g colours, gender or yes/no (binary)
Ordinal- clear ordering of categories e.g level of education, social class or a scale of disagree to agree
What are the different types of numeric variables?
Discrete- countable, measured on a continuum e.g money, age in years or number of cars
Continuous- measured numerically, infinite number (degree of accuracy) e.g height, time, distance
What’s ratio data?
Continuous variable with a meaningful zero point. An arbitrary zero is if you can have negative values
E.g distance and height and temperature in kelvin (not Fahrenheit or Celsius as can have negative of those)
What are independent and dependent variables?
Independent variable- the cause (intervention), value is independent of other variables in the study
Dependent variable- the effect (outcome), value changes depending on value of the independent variable (and maybe confounding/extraneous variables)
How do variables relate to observations?
- if you collected three pieces of info about each member of a group of participants the different peices of info of all would be a variable, e.g if you asked everyone’s favourite colour the favourite colour would be a variable
- the observation would be all three bits of information about an individual participant
What are pros and cons of bar charts?
- used to compare groups
- can be used to track changes over time
- single bar charts can’t see how variables compare to anything else
- can have bar charts showing multiple variables
What can stacked bar charts be used for?
- to find the relative decomposition of each primary bar based on a second variable
What are the key features of bar charts?
- use categorical data (counts)
- each bar is proportional to the value they represent
- equal space between bars
- X-axis could be anything
What do histograms show and their key features?
- shows frequency distribution
- data grouped into continuous data ranges
- each range corresponds to a bar
- no space between bars (continuous)
- X-axis should represent continuous numerical data
What can line graphs do?
- track changes over time
- track multiple variables
What can scatter plots show?
- trends/correlation (relationship between variables) (and strength of the relationship)
- can examine outliers
- can draw line of best fit
- each point is one observation
Positive correlation- variables increase or decrease together
Negative correlation- as one variable increases the other decreases
No correlation- no clear relationship between variables
How can scatter plots show the strength of the relationship of two variables?
- how close the points are to the line of the best fit can show the strength the relationship
- can measure the strength of the relationship using r
How does r show the strength of a relationship between two variables?
R ranges from 1 to -1
A value of 0 shows no correlation
A value of -1 shows perfect negative correlation
A value of 1 shows perfect positive correlation
What’s the relationship between correlation and causation?
Correlation does not equal causation.
With correlation it’s not known the direction of which variable is influencing the other (dependent vs independent) or if the relationship is caused by a confounding variable.
To be causation it has to be known the direction of which variable if causing the other and that the relationship isn’t caused by a confounding variable (or coincidence)
What’s a confounding variable?
An unmeasured variable that influences the variables under investigation
What are the two main categories of descriptive statistics
Measures of central tendency- averages e.g mean median and mode
Measures of dispersion-
spread e.g range/inter quartile range, standard deviation and variance
What are the different measures of central tendency and why do you need different ones?
- mean is the sum of all the values divided by the number of values
- median is the middle value in an ordered data set and useful when there are outliers or data is skewed
- mode is the value that occurs the most in the data set
Need all three as they all are more applicable in different situations e.g median over mode if lots of outliers or data is skewed otherwise it won’t be representative of the middle of the data set.
What might a very different median and mean show?
Might mean that the data has lots of outliers or is skewed.
What would you see with averages of perfectly symmetrically distributed data?
Mean, median and mode would be the same