2 - Summarising Data Flashcards
What does each row & each column represent?
Each row = an OBSERVATION (or record) & represents 1 person
Each column = a VARIABLE (e.g race, gender, DOB)
What are the 4 types of variables ?
- A nominal-scale variable
- Values are categories w/out numerical ranking e.g country of residence
- Nominal variables w/ only 2 categories are v common: alive/dead, ill/well, vax/unvax, smoked/didn’t smoke
- A nominal variable w/ 2 mutually exclusive categories = DICHOTOMOUS VARIABLE - An ordinal-scale variable
- Has values that can be ranked but aren’t necessarily evenly spaced e.g stage of cancer - An interval-scale variable
- Measured on a scale of equally spaced units, but w/out a true 0 point e.g DOB - A ratio-scale variable
- Interval variable w/ a true 0 pt e.g height in cm, systolic bp in mmHg, duration of illness in days
What are the 4 types of variables ?
- A nominal-scale variable
- An ordinal-scale variable
- An interval-scale variable
- A ratio-scale variable
A nominal-scale variable
- Values are categories w/out numerical ranking e.g country of residence
- Nominal variables w/ only 2 categories are v common: alive/dead, ill/well, vax/unvax, smoked/didn’t smoke
- A nominal variable w/ 2 mutually exclusive categories = DICHOTOMOUS VARIABLE
An ordinal-scale variable
- Has values that can be ranked but aren’t necessarily evenly spaced e.g stage of cancer
An interval-scale variable
- Measured on a scale of equally spaced units, but w/out a true 0 point e.g DOB
A ratio-scale variable
- Interval variable w/ a true 0 pt e.g height in cm, systolic bp in mmHg, duration of illness in days
What kind of variables are nominal- & ordinal-scale variables ?
QUALITATIVE or CATEGORICAL
What kind of variables are interval- & ratio-scale variables?
QUANTITATIVE or CONTINUOUS
Frequency distributions are represented in a histogram, with 3 main features. What are they?
- Central location (peak of distribution)
- Spread (how widely dispersed it is on both sides of peak)
- Shape (where it is approx symmetrical)
What are the 3 measures of central location?
- Mean
- Median
- Mode
What is spread & what are the 2 measures?
Aka variation or dispersion
- Range
- Standard deviation
What are the 2 possible shapes of a frequency distribution?
skewed vs symmetrical
What does skewness refer to? What does +vely or -vely skewed mean?
skewness refers to the TAIL, not the hump → so a distribution skewed to L has a long L tail
If skewed to R → +vely skewed
If skewed to L → -vely “
What is the normal of Gaussian distribution?
Classic bell-shaped curve
What is the median?
Middle value of a set of data thats been put into rank order, value that divides the data into 2 halves
50th percentile (of the distribution)
What is the mean?
Aka average
Best descriptive measure for data that are normally distributed
What is used instead of MEAN for data values which are skewed or have outliers?
MEDIAN
How does one select to use mean, median or mode?
- Characteristics of data – eg normally distributed or skewed & with/without outliers
- Reason for calculating the measure – eg descriptive or analytical purposes
Mean = measure of choice when data are normally distributed Median = measure for data not normally “
When data is not normally distributed, median is not preferred. True or false?
True
Mean uses all the data & is sensitive to outliers
Mode & median → unaffected by outliers
What are the 3 measures of spread?
- Range
- IQR
- SD
What are percentiles?
Divide data into distribution of 100 equal parts
Pth percentile (P goes from 0 to 100) = value that has P % of values falling at or below it → 90th percentile has 90% of values “ “
What are quartiles?
= grouping data into 4 equal parts/quartile
Each quartile = 25% of the data
Cut-off for the 1st quartile is the 25th percentile
Cut-off “ “ 2nd “ = 50th “
(etc etc)
What is the IQR (interquartile range)?
Measure of spread used most commonly w/ the median
Represents the central portion of the distribution, from the 25th percentile to 75th percentile
The IQR is generally used in conjunction with what?
median → together, useful to characterize central location & spread of any freq distributions → but esp skewed (asym) ones
What is a box plot?
graphical representation of locality, spread & skewness groups of numerical data thru their quartiles
Uses of IQR
If distrib is non-symmetric – use range & IQR (so median goes together w/ range & IQR)
What is standard deviation (SD)?
Variability in a set of data
Commonly used w/ mean
When is SD used?
Only when data is normally distributed (i.e data falls into bell-shaped curve)
For normally distributed data:
- Mean = recommended measure of central location
- SD = “ “ of spread
What is the standard error (se) of the mean?
Variability we may expect in means of repeated samples taken from the same population
Divide SD by square root of n
How is se calculated?
Divide SD by square root of n
What is standard error/se of mean used for?
Calculation of confidence intervals (confidence limits) around the mean
What is “inference”?
Epidemiologists conducting studies to make generalizations about the larger population
What does a narrow vs wide confidence interval (CI) mean?
Narrow CI → high precision
Wide CI → low precision
Narrower the interval, the more precise the estimate
Big studies → WIDE confidence intervals (more confident ab data obtained)
Small studies → NARROW “ “
What are confidence intervals (CIs) used for?
calculated for means but ALSO for:
- proportions, rates, risk ratios, odds ratios (& other measures where purpose = draw inferences from a study to the population)