summarising and displaying data Flashcards
—- are scales w underlying defined
unit.
example:
– A count (number of children)
– An accepted unit
* Years
* Metres
* Euros
these scales can be —- or —-
numeric scales
continuous or discrete
true or false:
-Many things cannot have a defined unit
as :Depression, satisfaction, pain
-We recognise that people can be satisfied, or in pain, to a
greater or lesser extent
-The problem is measuring these concepts without a defined
unit
true
—– Used to measure relative quantity
ordinal scales
age measured in years, unit of days are examples of
defined units
– Severity of pain: mild, moderate, severe
–Alcohol consumption: none, low, high
–Quality of life score: 0, 1, 2,….,10
are examples of:
ordinal scales ( check slide 12)
Numeric and ordinal scales are labels that tell us —- and the more basic example is — by which —- is the basis of measurement
how much
what
classification
Labelling schemes that classify people or things or events are —-
examples are:
nominal measurement scales
– Disease classification schemes e.g ICD 10 (International Classification
of Diseases)
– Eye color: Blue, green, brown, hazel, gray
– Types of activity: sitting, walking, cycling, swimming, other
nominal measurement scales tells us— of thing something is and its based on ——
what kind
agreed classification
Some scales have only two labels these are called —-
dichotomous scales
– Eye color: Blue, green, brown, hazel, gray
– Types of activity: sitting, walking, cycling, swimming, other
are examples of
nominal measurement as blood groups types
– Disease status: Presence or absence of disease
– Lab test result: Positive or negative
– Mortality : alive or dead status
– Exam result: Pass or fail
are examples of
dichotomous scales - simplest sort
types of variables summary:
1- —- variables
– Defined units, tell us how much in an absolute sense
– Can be continuous or discrete
Categorical variables
*—– scales
– Tell us how much, but in a relative rather than absolute sense
*—– scales
– Classify. Tell us what rather than how much
– Called —- scale when only two values
numeric
ordinal
nominal
dichotomous
Knowing the measurement scale of data informs us as to how we should —- and — it
display and summarise it
Summaries are — than the original because of what they leave out
* So any summary is a —- of the original
things can go wrong by:
1- We present aspects of the data that lead to the wrong conclusion
2- We leave out some important aspect of the data, leading to the reader drawing the wrong conclusion
- In practice, data analysts will examine the data in —– ways to make sure to avoid these pitfalls when reporting on them
smaller
simplification
different wats
The most basic summary statistic is a —-
frequency as count or percent ( check the graph of stacked histogram ) and we can use a frequency table
rule of thumbs:
—- for precise information
—- for patterns and understanding
numbers
graphs
A simple graph displaying
frequencies of categories is —-
– —- is preferable but often they
presented —-
bar graph
horizontal
vertical
When the data are measured on a
continuous scale but we have
relatively small amounts of data, we
can display the data as —
dots aka a dot plot this can be used for heights of women and men from a small study
– For men or women with the
same height, the dots are shown
beside each other
With —- amounts of data, we don’t need to rely on the summaries, we can simply show all the data in a plot
* But with —- datasets, the dots become too numerous and we rely more and more on summaries
small
larger
death in intensive care unit:
Patients had their risk of death calculated using —– scores
* These scores combines — to produce an —- of the—- of death
* The study also looked at length of stay
* These two variables - length of stay and APACHE-II scores
- the dots show will be —–
APACHE-II
risk indicators
overal prediction
chance
predicted risk of
death (APACHE-II scores)
( check slide 27 pls , 28)
Summarising the risk scores using % cut-offs :
- These summaries don’t show us — the data, but they give us a good idea of —
- they show —-
- and give some idea of how scores – around that
all
key marker
middle point/halfway
vary
—– is a value representing a cut off of a specified percentage of the data
percentiles but also called quantiles ( check graph 29 plsss)
—– is the half-way point of the data values.
– Strictly speaking, half of the values lie —— the median
– The — percentile!
median
lie at or below
50th
( check slide 31 PLSSSS)
—- is the average and it indicates approcimaently where the data is located on the number line.
and its calculated as:
mean
“Sum up the individual values then
divide by the number of them”
mean can be misleading tho ( check the bar graph 35 )
—- A “tail” of exceptionally long stay times push the mean up
—– a detailed summary of the objectives, methods, results, and conclusions of a full study report and these statistics that maintain their properties even if the underlying distributional assumptions are incorrect.
outliers
( check slide 36 for more info pls).
robust summary
The mean is sensitive to —-
while median is affected by —-
outliers
robust summary measures such as median while outliers have little effect ( the median is a robust statistic because it has a breakdown point of 50%)
– Omitting the four highest values moves the median from 3·55 to 3·50 – that’s a change
of about an hour (which is very small in comparison with the effect on the mean)
* This explains the differences we see when we look at medians instead of means
ranges gives us an idea of — which is not always a good idea and it needs 2 pieces of info which are: — and —
these two values are most likely to be —- cases or —-
- range is not — it while be affected by —
variability
biggest and smallest values
atypical cases and errors
robust
outliers
(The range of length of stay is 82 days, but it’s only 38 days if we ignore the
longest-staying patient, and 19 days if we ignore the three longest-staying
patients)
- A quarter of all patients scored 17 or less, and three quarters scored 66 or less
- So the middle 50% of patients scored between 17 and 66
– That’s a range of 49 (66 – 17) - This is called the —– which will be — to outliers bc they will occur at — and not —-
interquartile range (abbreviated as IQR)
extreme
middle
—– average of the squared differences from the mean , a measure of how far a set of numbers —–
– No-one apart from professional statisticians understand it fully
- Example: fasting blood sugar was checked for 10 employees
– The results are : 5.5, 6, 6.5, 7, 7.5, 8, 8.5, 9, 9.5, 10 mmol/l
Mean=7.75
Variance = (5.5−7.75)2+(6−7.75)2 + (6.5−7.75)2
…………… (10−7.75)2
9
=2.063
variance
spread out
- Square root of the variance is —-
– It is in the same units as the — - — SD indicates data points tend to be very close to the mean (and to
each other) - —- SD indicates that the data points are very spread out from the mean and from each other.
- blood sugar ex:
standard deviation SD
original value
small
large
𝑆𝐷 = square root of this 2.063 = 1.44
check slide 46 37 plssss)
box plots are useful for —- and present — key summary statistics for each group which are:
- shown in a —-
- they display the —- by building a box around —- and —
comparing groups
5
The minimum,
25th percentile,
50th percentile (median),
75th percentile and maximum
simple visual display
interquartile range
25th and 75th
biomedical example :
-Mass spectrometry experiments where proteins are —- in —- samples from patients
- Prior to identifying biomarkers of
interest:
1– Boxplots for each sample can be used to identify —- with sample preparation or with calibration of the mass spectrometer
– Based on this, samples may then be —- or—
– Note the whiskers, extending to min & max
quantified
biological
problems
excluded
re-aligned (normalization)
—– is a data point which is abnormally distant from the rest of the data
- we can modify a — to show outliers as:
– using a —- that is based on the IQR, we change the length of the whiskers*
– Individual points —- the whiskers are shown as outliers
- We can then further investigate the nature of the outliers:
– Often they are valid observations: reporting —– is recommended
outliers
box plot
detection rule
outside
robust summary statistics
check slide 51 52 53
- In the examples for Length of stay in ICU and BMI, there appeared to be an excess of high values
– An excess of low or high values is called —- these may be visualised as —- - A special case of data without skewness is the —-
skewness
dotplots, boxplots and histograms
normal distribution
true or false:
Importance of the normal distribution is bc s that it fits
many natural phenomena
* Many things we measure are
approximately normal
– e.g. blood pressure & height
* But nothing is truly normal (it is a
mathematical concept)
true