Summarising Data Flashcards
Why should data be summarised?
- Data quality monitoring
• Data checking + cleaning - check for invalid/missing entries
• Baseline data in a study - describe characteristics of participants in study e.g. 1st table in many research articles - to set study + results in context
• Before doing a complex analysis - so it makes sense
What is quantitative data and the 2 types?
- data which can be measured numerically
- continuous or discrete
What is continuous data and give e.g.s?
- data lie on a continuum
- can take any value between 2 limits
- e.g. weight, height
What is a limitation of continuous data?
accuracy of data depends on accuracy of method of measurement so that some continuous data may be recorded as integers although that is an approx to true value
What is discrete data and e.g?
- data do not lie on a continuum
- can only take certain values, usually counts (integers)
- no. of children in a family
Why is weight a continuous variable and what is the limitation?
- it is measured using weighing scales
- lies on a continuum
- limitation is the accuracy of the scales
Why is the number of previous pregnancies in a pregnant woman discrete data?
- it is counted
- only whole numbers are possible
What is ordinal data and which type of data is always ordinal?
– the data values can be arranged in a numerical order from the smallest to the largest.
- Quantitative data are always ordinal
What are e.g.s of ordinal data?
- Questionnaire scale data - often counts, e.g. when adding the no. of +ve responses to a set of questions to get a total score.
- Categorical data may also have an inherent orde, such as stage of disease.
What is an e.g. where continuous data can look discrete?
- because of the way they are measured and/or
reported. - e.g. gestational age of babies often reported in whole weeks, e.g. 38 weeks, - appears to be discrete.
- It is however continuous - could be reported to a greater degree of accuracy, e.g. as a decimal, such as 38.5 weeks
What are all continuous measurements limited by?
- the accuracy of the instrument used to measure
them, - many quantities are reported in whole numbers for convenience such as age and height
What is categorical data?
data where individuals fall into a number of separate categories or classes
Give e.g.s of categorical data
- gender: male or female = 2 classes
- disease status: alive or dead = 2 classes
- stage of cancer: I, II, III or IV = 4 classes
- marital status: married, single, divorced, widowed or legally separated = 5 classes
Give e.g.s of when categorical data can be ordinal?
- Different categories of categorical data may be assigned a number for coding purposes
- if there are several categories there may be an implied ordering, such as with stage of cancer where stage I is the least advanced and stage IV the most advanced.
What is dichotomous data and give e.g.?
- only 2 classes
- all individuals fall into one or other of the classes
- aka as binary data.
Is it possible to categorise continuous data?
- possible to re-classify continuous data into groups, for ease of reporting.
- e.g. it is common to report birthweight in bands, giving the numbers of babies who fall into each
birthweight band.
What are the consequences of dichotomising?
- lots of info + statistical power lost in the analysis.
- nature of any relationships may be masked. e.g. if relationship was curved, this may be weaker if the data were categorized
- if relationship was U-shaped, categorization may totally obscure it
Why is it better if continuous data are re-classified into several groups?
- effect on statistical power is less than when dichotomizing.
- Grouping causes no problem if re-classification done simply to present summary statistics but the original data are used in the analysis
- Sometimes can be useful when examining a non-linear relationship. The analysis may be more straightforward and more meaningful
How can continuous data be summarised?
- a measure of the centre of the data distribution
- measure of the variability of the data.
What are measures of centre of data?
- Mean
* Median
What are measures of variability of data?
- Standard deviation (variance)
- Range (minimum, maximum)
- Interquartile range
What is the mean and how is it calc?
- simple average of all the data:
- sum of all values divided by the total number of values aka the arithmetic mean.
What is the median and what is it when there is an even/odd no. of values in the sample?
- the middle value when the data are arranged in ascending order of size. (n + 1)/2 = pos of median
- odd no. of values in the sample: median will be the value with the same number of values both bigger than it and smaller than it.
- even number of values: there will be two middle values and the median will be the mean of the two.
What is standard deviation and what does it indicate?
- a measure of the average difference between the mean and each data value.
- indicates how dispersed the data are
How is SD calc?
- square root of the variance.
How is variance calc?
summing the squared differences between the overall mean and each value and then dividing by the number of values minus one.
What is the advantage of the standard deviation over the variance?
it is in the same units as the original data and so is easier to interpret
What happens to the equation when the whole population variance is calculated?
- a different denominator is used; we divide by n (but this almost never happens)
- Since we virtually always have a sample, the SD is obtained by dividing by n-1 because it can be shown to give a more accurate estimate of the population standard deviation
What is the range?
- diff between smallest + largest value
- usually expressed as min + max (sometimes actually diff between 2 values shown but is not as good)
What is interquartile range?
- range of values that includes middle 50% values
- bounded by upper + lower quartile
How are lower + upper quartile calc?
- Lower: ranking data sim to median and then taking value below which bottom 25% of data sit (n + 1)/4 = pos
- Upper: ranking data sim to median and then taking value above which top 25% of data sit 3 (n + 1)/4 = pos
Which summary measure of centre of dis be used for continuous data with symmetric distribution?
Arithmetic mean
Which summary measure of centre of dis be used for continuous data with +vely skewed distribution?
geometric/harmonic mean (but these don’t allow for 0 values)
Which summary measure of centre of dis be used for continuous data with skewed distribution?
Median
Which summary measure of centre of dis be used for discrete data?
- Median unless range of data is large enough to make calc of mean sensible
- e.g. no. of children in a family is discrete, while sometimes mean is calc e.g. 2.4 children - may be diff to interpret
Which summary measure of spread of dis be used for continuous data?
- SD
- range often useful if there’s room to present it
Which summary measure of spread of dis be used for continuous data with skew (unsymm)?
IQR
How can unordered/nominal data be summarised?
- using freq in each category together with either overall prop/%
- complete set of freq is freq dis
How can ordered/ordinal data be summarised?
- using freq + % but can also calc cumulative freq + % which is useful to show % below certain cut-off
What is a histogram and its features?
- diagram which shows dis of data by plotting data in rectangles (bins) corresponding to categories along x axis
- rectangles have heights/areas prop to freq (no.) in the categories
- y axis is freq per interval
How are the bins interpreted in a histogram?
- if widths of bins are same, height of each rectangle prop to its freq but if not then area ind freq
What does a histogram show?
- shape of dis
- range
- middle
What does a box & whisper plot show?
- median: horizontal line in box
- UQ: top edge of box
- LQ: lower edge of box
- LQ: lower edge of box
- max: top of whisker
- min: bottom of whisker
What can the shape of dis show about the data?
- central values
- extreme values
- where bulk of data lie
What is +vely skewed data?
- tail on RHS longer than tail on left
- large no. of indiv have low data values and few indiv have v.high values which stretches right tail
e. g. alcohol intake in pregnancy, chol, weight, blood pressure
What is -vely skewed data?
- tail on LHS longer than tail on right
- e.g. gestational age - pre-term births stretch left tail and many on the right as clinical practice of induction beyond 40 weeks + limiting size of mother/foetus
How are bar charts presented?
- each category given its own bar along x axis
- height of each prop to freq of observations
What are advs of bar charts?
- show freq /% in each category
- may be quicker to absorb than a table
What do pie charts show?
- dis of indiv in diff categories of variable where every indiv belongs to only 1 category
- each category given an area/slice of graph
- area of each slice prop to freq of observations within that category + calc by div whole pie (360 degrees) into slices
What is adv + disadv of pie charts?
- comparison of prop in diff pop groups
- hard to judge size of angle so cant judge figs/prop
How is data shown on normal dis graph and its use?
- 95% data lies within mean +/- 2xSD
- 65% data lies within mean +/- 1xSD
- normal dis used for normal ranges where you expect normal, healthy values to lie