Statistics Flashcards
Why is data analysis important?
Need to understand what question you want to address.
Need to understand what data you need to answer a question.
Understand what data you have and how reliable it is.
Don’t be fooled by the questions and /or the data.
How should data be careful?
Be careful how you formulate the question.
e.g. do patients have a different heart rate before and after taking beta blockers.
Does a strain of bacteria have a smaller/greater growth rate than another.
Is there a relationship between heart rate and blood pressure in elderly patients?
What is a variable?
A variable is something that takes on different values that can be measured or counted.
e.g. sex, height.
What are the types of variable?
Categorical and numercial.
What are categorical variables?
Qualitative:
Binary - 2 categories e.g. yes/no
Nominal - unordered, more than 2 categories e.g. blood group
Ordinal - data order matters - e.g. disease stage.
What are numerical variables?
Quantiative:
Discrete - integer values, typically counts e.g. number of bacterial colonies
Continuous - uninterrupted values, any value in any range e.g. weight or height.
What are descriptive statistics?
Describe and summarise data:
Mean
Median
Mode
What is the mean?
The average.
Add the values of a set of observations together and divide by the number of observations.
If the data is skewed then will lean towards tail end.
What is the median?
The exact middle value.
Sort observations into ascending order then, if odd number of observations then find middle value.
If even number, find middle two values and average them.
When should mean or median be used?
Median is robust to extreme values, whereas mean will change
Mean is best for symmetric distributions without outliers (extreme values).
Median - useful for skewed distributions or data with outliers.
What is the mode?
The values in the data that occur most frequently.
There can be multiple modes - bimodal.
How is variability measured?
The spread of data can be assessed by:
Variances
Standard deviation
Range
Interquartile range (IQR)
What is variance?
Variance measures the extent each observation deviates from the mean.
The larger the deviations the greater the variance
Can’t use the average of the deviations itself as positives and negative measures will cancer out.
Overcome by using the square of deviations, the mean is then calculated using squared deviations.
What is standard deviation?
The square root of variance
Same units as raw data, easier to interpret than variance
Represents an average of the deviations of the data.
What is the range?
The differences between largest and smallest observation.
Can see which values are variable.
What is a percentile?
A value on a scale of one hundred that indicates the percent of a distribution that is equal to or below it.
Sort the observations into ascending order.
A percentile is the percentage of the data up to that value.
50th percentile will be the median.
What is the interquartile range?
IQR is the difference between the 75th percentile (3rd quartile) and 25th percentile (1st quartile.
Includes the middle 50% of data.
Visualisation: box and whiskers, boxplot
Very spread out, the whiskers will be very far apart.
Why is visualising data important?
Provide summary pictures, spot patterns, trends and anomalies in data before formal statistical analyses.
What are the requirements for visualising data?
Axes need labels and units of measurement where relevant
Include title and legend to indicate relevant elements of the plot.
Make the scale of your axes relevant to the data you are trying to portray.
If you are comparing multiple plots, make sure axes scales are the same.
What is a histogram?
Plots distribution of a numeric value.
Counting of frequency of specific measure for all possible measures.
Normal distribution is symmetric.
Some distributions are not symmetric, are skewed.
e.g. negative skew - negative tail is high, more negative values than expected.
Positive skew, more positive values.
What are boxplots?
Used for continuous data
Easy to compare groups
Box is IQR
Whiskers indicate maximum and minimum values, or the values without outliers.
Median is the middle of the box
What are scatter diagrams?
Use when both variables are continuous (not finite number)
Used to illustrate a relationship between two variables
When one variable increases does the other increase or decrease or nothing.
Before formalising the data.