Summarising data Flashcards
What are the two types of variables?
Categorical
Numerical
What is a categorical variable?
Give 3 subtypes:
Non-numerical (each value is associated with a catgory e.g.
- Ordered categorical (ordinal) e.g. social class = assigned a number
- Unordered categorical (nominal) e.g. blood group = named group
- Dichotomous/binary e.g. gender = two categories only
Give two subtypes of numerical variable:
- Continuous e.g. height = infinite no. of distinct values
- Discrete/counts e.g. number of siblings = only specific no of variables
What type of variable is severity (low, moderate or severe) of dental erosion?
Ordinal
What type of study is ALSPAC (avon longitudinal study of parents and children)?
Prospective cohort study
Recruit pregnant women living in Avon with a due date between April 1991 and Dec 1992
How can one categorical variable be shown?
Bar chart
Pie chart
Frequency table
How can one continuous variable be presented?
Histogram
Bar chart
Pie chart
How can a categorical outcome and categorical exposure be presented?
Contingency (2 way) table
Outcome = columns
Exposure = rows]
Each cell usually shows count and % within exposure = gives some idea of relationship between outcome and exposure
How can a numerical outcome and categorical exposure be presented?
Box and whisker blot (n.b. whiskers are used by diff people to represent different measures
Can compare distributions in >2 groups
How can a numerical outcome and numerical exposure be presented?
Scatter plot (has regression line)
What is the most appropriate graph for displaying adult height according to social class?
Box and whisker plot
(adult height = outcome = continuous; social class - exposure = categorical)
What are the 3 different measures of central tendancy?
Mean
Median (more useful if there are extreme/outlying values or data is not symmetrically distributed)
Mode (depends on precision of data -> if sufficiently precise each reading can be distinguished from the other = mode wont exist)
N.B. these vary!
What are the 4 measures of variability (extent of spread around the centre)?
Range (depends soley on two extrene values = may be inrepresentative of the whole set)
Interquartile range (often more useful if use the median instead of the mean)
Standard deviation (must be approximately symmetrically distributed to be meaningful)
Variance (SD2)
What are the 3 types of distributions?
Normal (symmetrical)
Positively skewed (long tail to right)
Negativiely skewed (long tail to left)

What do we do to positively skewed data to convert it to approximate normality?
Log transformation (must remember to transfer any means and SD back to the origional units for comparison)
n.b. other transformations may be required for negitively skewed data
What is the geometric mean?
The exponential of mean value calculated from logged data
What is a normal distribution?
95% of observations enclosed within mean +/- 1.96 SD
Mean & median = identical
SD determines shape (small = tall and narrow, large = shorter and fatter)
What is a reference range?
A further measure of variability = amount of variation between individual observations of dae = used in clinic to determine if patient is clinically normal or not
e.g. 95% reference range = mean +/- 1.96 SD
Can also have 90 & 99% reference ranges etc.
Defined using properties of normal distribution
The shape of curve for a normal distribution has a large standard deviation is ________ than one with a small standard deviation
Shorter/wider
What are the 3 measures of outcome occurence?
(other types of summary measures)
Prevalence
Incidence
Incidence rate
What does prevalence and incidence calculations exclude?
Cure
Death
Emigration of ill people
What is the link between prevalence and incidence?
Prevalence = incidence X average outcome duration
only if prevalence <0.1 and prevalence/incidence is constant
What tells us how many new cases have occurred in a particular time period?
Incidence
How do we determine an association between two contunuous variables?
Examine data graphically (scatter plot = initial feeling of relationship)
Statistical quantification of linear association = correlation (closely associated to linear regression)
Pearson’s correlation coefficient quantifies the linear association between two variables in terms of what?
Direction and strength
What values does pearsons correlation coefficient range from?
+1 = variables tend to have higher or lower values together (closer assocation = closer +1)
0 = variables not linearly associated
-1 = High values of one variable tend to be associated with low values of the other (closer assocation = closer to -1)
What does a perfect correlation look like?

What is an absent correlation and what does it look like?
No linear association betwen variables (may be quadratic though)

What type of variable is Sex?
and how is it best graphically presented?
Categorical (binary)
Bar chart or pie chart
What type of variable is Age?
and how is it best graphically presented?
Numerical (continuous)
Histogram
What type of variable is Ethnicity?
and how is it best graphically presented?
Categorical (nominal)
Bar chart/pie chart
What type of variable is Height?
and how is it best graphically presented?
Numerical (continuous)
Histogram
What type of variable is Social class group?
and how is it best graphically presented?
Categorical (ordinal)
Bar chart/ Pie chart
What type of variable is number of fillings?
and how is it best graphically presented?
Numerical (discrete)
Bar chart
What type of variable is Fat mass?
and how is it best graphically presented?
Numerical (continuous)
Histogram
If asked to describe what each of the summary statistics tells us about two variables make sure to:
Write the actual values out from the table and explain what the summary statistic is e.g. middle value of data ranked is 150.7 cm
How can you tell from summary statistics if data is normally distributed or not?
If mean and median are very similar = normally distributed
Which gives a better representation of the average? Arithemtic mean or geometric mean?
Geometric -> not overly influenced by the very large values in a skewed distribution
What does age standarisation mean?
And when is it appropriate?
Adjusting the rates to minimise the effects of differences in age composition when comparing across different populations
Appropriate when comparing a statistic across populations with differing age distributions (otherwise results could be misleading)
What two reasons may cause an increase in prevalence of disease in a population while the incidence remains fairly constant?
- Average duration of disease increased due to improvements in treatment to prolong life
- Average duration of disease increases due to improvements in diagnosis i.e. earlier diagnosis (prolong the period people know they have the disease)
What type of variable is number of siblings?
Numerical (discreet) = number but cannot have 0.4 children = not a continuum
How could you display numerical exposure and outcome data?
Scatter plot
Which of the following is NOT a measure of variabilty in the population?
- Interquartile range
- Standard error
- Standard deviation
Standard error = used to make inferences outside of the people we are measuring
The location and variability of a normally distributed variable are usually summarised by which of the following:
- Median & IQR
- Mean & SD
- Mode & range
Meand & SD
n.b. mode, median and mean are equivilent in a normal distribution
Which of th following is a false statement about referance ranges?
- They can be interpreted as likely values for an individual in the population
- 90, 95 and 99% reference ranges can all be calculated
- They are a measure of location
- They are a measure of location
= variability of mean!
In a population of adults, if the number of teeth remaining has mean 30 and SD 4, could these data be normally distributed?
No -> cannot have more than 32 teeth!
If it were a normal distribution we would assume that over 32 teeth could exist
For a particular outcome, what is the deifnition of prevalence and outcome?
Proportion with the outcome at a particular point in time
Should pairs of continuous variable always be examined graphically before analysis to check for non-linear associations?
Yes