Summarising Data Flashcards by Annabel Mensah-Djan

Why should data be summarised?

Data quality monitoring
• Data checking + cleaning - check for invalid/missing entries
• Baseline data in a study - describe characteristics of participants in study e.g. 1st table in many research articles - to set study + results in context
• Before doing a complex analysis - so it makes sense

How well did you know this?

Not at all

Perfectly

What is quantitative data and the 2 types?

data which can be measured numerically

- continuous or discrete

How well did you know this?

Not at all

Perfectly

What is continuous data and give e.g.s?

data lie on a continuum
can take any value between 2 limits
e.g. weight, height

How well did you know this?

Not at all

Perfectly

What is a limitation of continuous data?

accuracy of data depends on accuracy of method of measurement so that some continuous data may be recorded as integers although that is an approx to true value

How well did you know this?

Not at all

Perfectly

What is discrete data and e.g?

data do not lie on a continuum
can only take certain values, usually counts (integers)
no. of children in a family

How well did you know this?

Not at all

Perfectly

Why is weight a continuous variable and what is the limitation?

it is measured using weighing scales
lies on a continuum
limitation is the accuracy of the scales

How well did you know this?

Not at all

Perfectly

Why is the number of previous pregnancies in a pregnant woman discrete data?

it is counted

- only whole numbers are possible

How well did you know this?

Not at all

Perfectly

What is ordinal data and which type of data is always ordinal?

– the data values can be arranged in a numerical order from the smallest to the largest.
- Quantitative data are always ordinal

How well did you know this?

Not at all

Perfectly

What are e.g.s of ordinal data?

Questionnaire scale data - often counts, e.g. when adding the no. of +ve responses to a set of questions to get a total score.
Categorical data may also have an inherent orde, such as stage of disease.

How well did you know this?

Not at all

Perfectly

What is an e.g. where continuous data can look discrete?

because of the way they are measured and/or
reported.
e.g. gestational age of babies often reported in whole weeks, e.g. 38 weeks, - appears to be discrete.
It is however continuous - could be reported to a greater degree of accuracy, e.g. as a decimal, such as 38.5 weeks

How well did you know this?

Not at all

Perfectly

What are all continuous measurements limited by?

the accuracy of the instrument used to measure
them,
many quantities are reported in whole numbers for convenience such as age and height

How well did you know this?

Not at all

Perfectly

What is categorical data?

data where individuals fall into a number of separate categories or classes

How well did you know this?

Not at all

Perfectly

Give e.g.s of categorical data

gender: male or female = 2 classes
disease status: alive or dead = 2 classes
stage of cancer: I, II, III or IV = 4 classes
marital status: married, single, divorced, widowed or legally separated = 5 classes

How well did you know this?

Not at all

Perfectly

Give e.g.s of when categorical data can be ordinal?

Different categories of categorical data may be assigned a number for coding purposes
if there are several categories there may be an implied ordering, such as with stage of cancer where stage I is the least advanced and stage IV the most advanced.

How well did you know this?

Not at all

Perfectly

What is dichotomous data and give e.g.?

only 2 classes
all individuals fall into one or other of the classes
aka as binary data.

How well did you know this?

Not at all

Perfectly

Is it possible to categorise continuous data?

possible to re-classify continuous data into groups, for ease of reporting.
e.g. it is common to report birthweight in bands, giving the numbers of babies who fall into each
birthweight band.

How well did you know this?

Not at all

Perfectly

What are the consequences of dichotomising?

lots of info + statistical power lost in the analysis.
nature of any relationships may be masked. e.g. if relationship was curved, this may be weaker if the data were categorized
if relationship was U-shaped, categorization may totally obscure it

How well did you know this?

Not at all

Perfectly

Why is it better if continuous data are re-classified into several groups?

effect on statistical power is less than when dichotomizing.
Grouping causes no problem if re-classification done simply to present summary statistics but the original data are used in the analysis
Sometimes can be useful when examining a non-linear relationship. The analysis may be more straightforward and more meaningful

How well did you know this?

Not at all

Perfectly

How can continuous data be summarised?

a measure of the centre of the data distribution

- measure of the variability of the data.

How well did you know this?

Not at all

Perfectly

What are measures of centre of data?

Mean

* Median

How well did you know this?

Not at all

Perfectly

What are measures of variability of data?

Standard deviation (variance)
Range (minimum, maximum)
Interquartile range

What is the mean and how is it calc?

simple average of all the data:

- sum of all values divided by the total number of values aka the arithmetic mean.

What is the median and what is it when there is an even/odd no. of values in the sample?

the middle value when the data are arranged in ascending order of size. (n + 1)/2 = pos of median
odd no. of values in the sample: median will be the value with the same number of values both bigger than it and smaller than it.
even number of values: there will be two middle values and the median will be the mean of the two.

What is standard deviation and what does it indicate?

a measure of the average difference between the mean and each data value.
indicates how dispersed the data are

How is SD calc?

- square root of the variance.

How is variance calc?

summing the squared differences between the overall mean and each value and then dividing by the number of values minus one.

What is the advantage of the standard deviation over the variance?

it is in the same units as the original data and so is easier to interpret

What happens to the equation when the whole population variance is calculated?

- a different denominator is used; we divide by n (but this almost never happens) - Since we virtually always have a sample, the SD is obtained by dividing by n-1 because it can be shown to give a more accurate estimate of the population standard deviation

What is the range?

- diff between smallest + largest value | - usually expressed as min + max (sometimes actually diff between 2 values shown but is not as good)

What is interquartile range?

- range of values that includes middle 50% values | - bounded by upper + lower quartile

How are lower + upper quartile calc?

- Lower: ranking data sim to median and then taking value below which bottom 25% of data sit (n + 1)/4 = pos - Upper: ranking data sim to median and then taking value above which top 25% of data sit 3 (n + 1)/4 = pos

Which summary measure of centre of dis be used for continuous data with symmetric distribution?

Arithmetic mean

Which summary measure of centre of dis be used for continuous data with +vely skewed distribution?

geometric/harmonic mean (but these don't allow for 0 values)

Which summary measure of centre of dis be used for continuous data with skewed distribution?

Median

Which summary measure of centre of dis be used for discrete data?

- Median unless range of data is large enough to make calc of mean sensible - e.g. no. of children in a family is discrete, while sometimes mean is calc e.g. 2.4 children - may be diff to interpret

Which summary measure of spread of dis be used for continuous data?

- SD | - range often useful if there's room to present it

Which summary measure of spread of dis be used for continuous data with skew (unsymm)?

IQR

How can unordered/nominal data be summarised?

- using freq in each category together with either overall prop/% - complete set of freq is freq dis

How can ordered/ordinal data be summarised?

- using freq + % but can also calc cumulative freq + % which is useful to show % below certain cut-off

What is a histogram and its features?

- diagram which shows dis of data by plotting data in rectangles (bins) corresponding to categories along x axis - rectangles have heights/areas prop to freq (no.) in the categories - y axis is freq per interval

How are the bins interpreted in a histogram?

- if widths of bins are same, height of each rectangle prop to its freq but if not then area ind freq

What does a histogram show?

- shape of dis - range - middle

What does a box & whisper plot show?

- median: horizontal line in box - UQ: top edge of box - LQ: lower edge of box - LQ: lower edge of box - max: top of whisker - min: bottom of whisker

What can the shape of dis show about the data?

- central values - extreme values - where bulk of data lie

What is +vely skewed data?

- tail on RHS longer than tail on left - large no. of indiv have low data values and few indiv have v.high values which stretches right tail e. g. alcohol intake in pregnancy, chol, weight, blood pressure

What is -vely skewed data?

- tail on LHS longer than tail on right - e.g. gestational age - pre-term births stretch left tail and many on the right as clinical practice of induction beyond 40 weeks + limiting size of mother/foetus

How are bar charts presented?

- each category given its own bar along x axis | - height of each prop to freq of observations

What are advs of bar charts?

- show freq /% in each category | - may be quicker to absorb than a table

What do pie charts show?

- dis of indiv in diff categories of variable where every indiv belongs to only 1 category - each category given an area/slice of graph - area of each slice prop to freq of observations within that category + calc by div whole pie (360 degrees) into slices

What is adv + disadv of pie charts?

- comparison of prop in diff pop groups | - hard to judge size of angle so cant judge figs/prop

How is data shown on normal dis graph and its use?

- 95% data lies within mean +/- 2xSD - 65% data lies within mean +/- 1xSD - normal dis used for normal ranges where you expect normal, healthy values to lie