Summarising Data Flashcards

1
Q

Why should data be summarised?

A
  • Data quality monitoring
    • Data checking + cleaning - check for invalid/missing entries
    • Baseline data in a study - describe characteristics of participants in study e.g. 1st table in many research articles - to set study + results in context
    • Before doing a complex analysis - so it makes sense
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is quantitative data and the 2 types?

A
  • data which can be measured numerically

- continuous or discrete

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is continuous data and give e.g.s?

A
  • data lie on a continuum
  • can take any value between 2 limits
  • e.g. weight, height
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a limitation of continuous data?

A

accuracy of data depends on accuracy of method of measurement so that some continuous data may be recorded as integers although that is an approx to true value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is discrete data and e.g?

A
  • data do not lie on a continuum
  • can only take certain values, usually counts (integers)
  • no. of children in a family
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is weight a continuous variable and what is the limitation?

A
  • it is measured using weighing scales
  • lies on a continuum
  • limitation is the accuracy of the scales
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why is the number of previous pregnancies in a pregnant woman discrete data?

A
  • it is counted

- only whole numbers are possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is ordinal data and which type of data is always ordinal?

A

– the data values can be arranged in a numerical order from the smallest to the largest.
- Quantitative data are always ordinal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are e.g.s of ordinal data?

A
  • Questionnaire scale data - often counts, e.g. when adding the no. of +ve responses to a set of questions to get a total score.
  • Categorical data may also have an inherent orde, such as stage of disease.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is an e.g. where continuous data can look discrete?

A
  • because of the way they are measured and/or
    reported.
  • e.g. gestational age of babies often reported in whole weeks, e.g. 38 weeks, - appears to be discrete.
  • It is however continuous - could be reported to a greater degree of accuracy, e.g. as a decimal, such as 38.5 weeks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are all continuous measurements limited by?

A
  • the accuracy of the instrument used to measure
    them,
  • many quantities are reported in whole numbers for convenience such as age and height
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is categorical data?

A

data where individuals fall into a number of separate categories or classes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Give e.g.s of categorical data

A
  • gender: male or female = 2 classes
  • disease status: alive or dead = 2 classes
  • stage of cancer: I, II, III or IV = 4 classes
  • marital status: married, single, divorced, widowed or legally separated = 5 classes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Give e.g.s of when categorical data can be ordinal?

A
  • Different categories of categorical data may be assigned a number for coding purposes
  • if there are several categories there may be an implied ordering, such as with stage of cancer where stage I is the least advanced and stage IV the most advanced.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is dichotomous data and give e.g.?

A
  • only 2 classes
  • all individuals fall into one or other of the classes
  • aka as binary data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Is it possible to categorise continuous data?

A
  • possible to re-classify continuous data into groups, for ease of reporting.
  • e.g. it is common to report birthweight in bands, giving the numbers of babies who fall into each
    birthweight band.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What are the consequences of dichotomising?

A
  • lots of info + statistical power lost in the analysis.
  • nature of any relationships may be masked. e.g. if relationship was curved, this may be weaker if the data were categorized
  • if relationship was U-shaped, categorization may totally obscure it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Why is it better if continuous data are re-classified into several groups?

A
  • effect on statistical power is less than when dichotomizing.
  • Grouping causes no problem if re-classification done simply to present summary statistics but the original data are used in the analysis
  • Sometimes can be useful when examining a non-linear relationship. The analysis may be more straightforward and more meaningful
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

How can continuous data be summarised?

A
  • a measure of the centre of the data distribution

- measure of the variability of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are measures of centre of data?

A
  • Mean

* Median

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are measures of variability of data?

A
  • Standard deviation (variance)
  • Range (minimum, maximum)
  • Interquartile range
22
Q

What is the mean and how is it calc?

A
  • simple average of all the data:

- sum of all values divided by the total number of values aka the arithmetic mean.

23
Q

What is the median and what is it when there is an even/odd no. of values in the sample?

A
  • the middle value when the data are arranged in ascending order of size. (n + 1)/2 = pos of median
  • odd no. of values in the sample: median will be the value with the same number of values both bigger than it and smaller than it.
  • even number of values: there will be two middle values and the median will be the mean of the two.
24
Q

What is standard deviation and what does it indicate?

A
  • a measure of the average difference between the mean and each data value.
  • indicates how dispersed the data are
25
Q

How is SD calc?

A
  • square root of the variance.
26
Q

How is variance calc?

A

summing the squared differences between the overall mean and each value and then dividing by the number of values minus one.

27
Q

What is the advantage of the standard deviation over the variance?

A

it is in the same units as the original data and so is easier to interpret

28
Q

What happens to the equation when the whole population variance is calculated?

A
  • a different denominator is used; we divide by n (but this almost never happens)
  • Since we virtually always have a sample, the SD is obtained by dividing by n-1 because it can be shown to give a more accurate estimate of the population standard deviation
29
Q

What is the range?

A
  • diff between smallest + largest value

- usually expressed as min + max (sometimes actually diff between 2 values shown but is not as good)

30
Q

What is interquartile range?

A
  • range of values that includes middle 50% values

- bounded by upper + lower quartile

31
Q

How are lower + upper quartile calc?

A
  • Lower: ranking data sim to median and then taking value below which bottom 25% of data sit (n + 1)/4 = pos
  • Upper: ranking data sim to median and then taking value above which top 25% of data sit 3 (n + 1)/4 = pos
32
Q

Which summary measure of centre of dis be used for continuous data with symmetric distribution?

A

Arithmetic mean

33
Q

Which summary measure of centre of dis be used for continuous data with +vely skewed distribution?

A

geometric/harmonic mean (but these don’t allow for 0 values)

34
Q

Which summary measure of centre of dis be used for continuous data with skewed distribution?

A

Median

35
Q

Which summary measure of centre of dis be used for discrete data?

A
  • Median unless range of data is large enough to make calc of mean sensible
  • e.g. no. of children in a family is discrete, while sometimes mean is calc e.g. 2.4 children - may be diff to interpret
36
Q

Which summary measure of spread of dis be used for continuous data?

A
  • SD

- range often useful if there’s room to present it

37
Q

Which summary measure of spread of dis be used for continuous data with skew (unsymm)?

A

IQR

38
Q

How can unordered/nominal data be summarised?

A
  • using freq in each category together with either overall prop/%
  • complete set of freq is freq dis
39
Q

How can ordered/ordinal data be summarised?

A
  • using freq + % but can also calc cumulative freq + % which is useful to show % below certain cut-off
40
Q

What is a histogram and its features?

A
  • diagram which shows dis of data by plotting data in rectangles (bins) corresponding to categories along x axis
  • rectangles have heights/areas prop to freq (no.) in the categories
  • y axis is freq per interval
41
Q

How are the bins interpreted in a histogram?

A
  • if widths of bins are same, height of each rectangle prop to its freq but if not then area ind freq
42
Q

What does a histogram show?

A
  • shape of dis
  • range
  • middle
43
Q

What does a box & whisper plot show?

A
  • median: horizontal line in box
  • UQ: top edge of box
  • LQ: lower edge of box
  • LQ: lower edge of box
  • max: top of whisker
  • min: bottom of whisker
44
Q

What can the shape of dis show about the data?

A
  • central values
  • extreme values
  • where bulk of data lie
45
Q

What is +vely skewed data?

A
  • tail on RHS longer than tail on left
  • large no. of indiv have low data values and few indiv have v.high values which stretches right tail
    e. g. alcohol intake in pregnancy, chol, weight, blood pressure
46
Q

What is -vely skewed data?

A
  • tail on LHS longer than tail on right
  • e.g. gestational age - pre-term births stretch left tail and many on the right as clinical practice of induction beyond 40 weeks + limiting size of mother/foetus
47
Q

How are bar charts presented?

A
  • each category given its own bar along x axis

- height of each prop to freq of observations

48
Q

What are advs of bar charts?

A
  • show freq /% in each category

- may be quicker to absorb than a table

49
Q

What do pie charts show?

A
  • dis of indiv in diff categories of variable where every indiv belongs to only 1 category
  • each category given an area/slice of graph
  • area of each slice prop to freq of observations within that category + calc by div whole pie (360 degrees) into slices
50
Q

What is adv + disadv of pie charts?

A
  • comparison of prop in diff pop groups

- hard to judge size of angle so cant judge figs/prop

51
Q

How is data shown on normal dis graph and its use?

A
  • 95% data lies within mean +/- 2xSD
  • 65% data lies within mean +/- 1xSD
  • normal dis used for normal ranges where you expect normal, healthy values to lie