Week 9 Kuracloud: Measuring and Summarising Data Flashcards
Statistics
=
(Kirkwood & Sterne. Essential Medical Statistics, 2nd ed., 2010)
= “the science of collecting, summarising, presenting and interpreting data, and of using them to estimate the magnitude of associations and test hypotheses”
(Kirkwood & Sterne. Essential Medical Statistics, 2nd ed., 2010)
Descriptive Statistics
= describes features of data sample
“summarising, presenting and interpreting data”
Inferential Statistics
= infer findings of sample to target population
“estimate the magnitude of associations and test hypotheses”
Data
=
= “a set of values of subjects with respect to qualitative or quantitative variables”
Raw Data
=
= observations
Data set
=
= collection of information regarding a group of people or other items
Variables
=, 2
= characteristics that you can measure or observe and may take any one of a specified set of values
- Numerical (quantitative) (or interval/ratio data)
- Categorical (qualitative)
Categorical Variables
2,1
- ordered/ordinal = rank in categories in an order
- unordered/nominal = place observations in named, unordered groups
- dichotomous/binary
Numerical Variables
2
- continuous = on a continuos scale, can take any value in range
- discrete = finite options, usually countable
Derived variable
=,
= new variable created from existing variable
variable measured as numerical –> categorical
Spreadsheets of datasets
3
- Columns: each represents 1 variable (first usually identifier)
- Rows: each represents data for 1 person (record)
- Cells: value of 1 variable for 1 person = observation
Outcome variable
=, (3)
= focus of attention, we try to explain its variation
(dependant variable/response variable/y-variable)
Exposure Variable
=, (3)
= influences variation of outcome variable
(independant variable/predictor variable/x-variable)
Operationalising Variables
=,
= deciding which category designates individual as having an outcome/exposed
dictates interpretation of results
Nominal (unordered categorical) variable measurement
2
- frequencies (no. observations in each category)
- proportions (relative frequencies)
Ordinal (ordered categorical) measurement
2
- frequencies
- proportions
- sometimes means and medians
Numerical (interval/ratio) measurement
3
- mean
- median
- standard deviation
Nominal (unordered categorical) graphical representation
3
- pie chart
- column/bar graph
- stacked column/bar graph
Ordinal (ordered categorical) graphical representation
1
- column/bar graph
Numerical (interval/ratio) graphical reprentation
4
- bar graph (data grouped)
- histogram (data grouped)
- box and whisker plot (summary statistics)
- line graph (over time)
Relative frequencies
=, 3
= proportion/percentage of total number
presented in:
- table
- bar graph
- pie chart
Epidemiological prevalence or cumulative incidence
2
Presentation: proportion/percentage
Type: dichotomous categorical variables
Frequency distribution
=, 2, 2
= distribution of values of a numerical variable
- first step in analysing numerical data
- displayed in a histogram
- for discrete: individual frequencies displayed
- for continuous: frequencies of formed groups/ranges
Histogram vs Bar graph
histogram has no gaps between bars because continous data
Histograms show us:
5
- spread
- skew
- mode
- gaps
- unusual values
Histogram Shapes
- positively skewed
- symmetrical
- negatively skewed
Positively Skewed
=,
= asymmetrical distribution in which “upper tail is longer than lower tail” (higher frequency at left/lower values)
^\__
mean > median
Symmetrical
=,
= symmetrical distribution around centre, bell curve, normal distribution, Gaussian distribution
_/^_
mean, median, mode almost equal
Negatively Skewed
=,
= asymmetrical distribution in which “lower tail is longer than upper tail” (higher frequency at higher/right values)
/^
mean < median
Measures of Central Tendency
3
- mean
- median
- mode
Measures of Variability
3
- range
- interquartile range/IQR (difference between 1st and 3rd quartiles)
- standard deviation
Standard deviation (SD)
= measure of spread about mean
calculation:
1. differences of each observation from mean taken (deviations)
2. Deviations are squared
3. Add deviations together
4. divide by no. observations - 1 (= variance = SD squared)
5. Square root
Theoretical Frequency Distribution/Standard Normal Distribution properties
(or PDF = probability density function)
8
- symmetrical about mean (bell curve)
- mean = 0, SD = 1
- tall and narrow for small SD, short and wide for large SD
- 68% lie within 1 SD of mean
- 95% lie within 2 (actually 1.95) SDs of mean
- 99% lie within 3 SDs of mean
- use mean and SD to find proportion lying between any two values
- probability of any specific value is 0
95% reference range/central reference range
=
= range of expected normal values in a population, values that enclose 95% population (1.95 or 2 SD either side of mean)
Assumption of Normality
=, 2
= assuming values of a continuous variable are normally distributed before calculations
Distribution may be skewed if:
1. Mean and median are very different
2. Very large SD, 95% reference range falls outside of possible values or is negative
Aggregated Data
=
= units of observation are combined not individual level
Univariate analysis
=
= describes single variable
Bivariate analysis
=,
= relationship between 2 variables
- exposure –> outcome, test hypothesis
When both variables categorical:
4
display relationship by cross-tabulating in a contingency table
- rows: exposure
- columns: outcomes (no outcome column eliminated if percentages)
used to calculate odds rations
Categorical Measures of association
3
- odds ratio = strength of association between variables (yes/no –> odds for variable 1/odds for variable 2)
- risk ratio (only in longitudinal)
- prevalence ratio (good for cross-sectional)
When both variables numerical
Scatterplot
- x-axis: exposure
- y-axis: outcome
Numerical Measures of Association
===,4
r = correlation coefficient = strength of linear association between two continuous variables = number of SD that outcome changes for 1 SD when exposed
- always between -1 and 1
- r < 0: inverse correlation
- r = 0: no association
- r > 0: correlation
- r = 1: perfect correlation, straight line