Stats Flashcards
Explain the two categories of data
Categoric:
Nominal, binary and ordinal
Numeric:
Continuous and discrete
Give 3 ways of describing categoric data
Example scenario:
Total participants 676, those using drug and have MI = 31, those using the drug and no MI = 310
Those using placebo and have MI =61 , those using placebo and no MI = 305
Risk for MI with drugs = 31/341 =0.091 (9.1% -just x 100 to get percentage)
Risk for MI with placebo = 61/366 = 0.167 (16.7% to get percentage)
Odds for MI with drugs = 31/310 = 0.1
Odds for MI with placebo = 61/305 = 0.2
(Absolute) risk difference = 0.167 - 0.091 = 0.076 (7.6%) this means the risk with a placebo is 7.6% higher than the drug
Usually just called risk difference but absolute is added when we do not worry about the minus sign
(Relative) Risk ratio = 0.167/0.091 = 1.835 (185%) (placebo on top so becomes focus group), this means the risk is increased by 85% with the placebo compared to the drug, 0.091/0.167 = 0.545 (54.5%) (drug on top so becomes focus group), this means the risk is decreased by 45.5% with the drug than with the placebo
Usually referred to as risk ratio RRR, but actually called relative risk ratio
Odd ratio = 0.2/0.1 = 2 (with the placebo as the focus group) this means that it has increased by 1, there is an 100% increase in the odds of having an MI on the placebo compared to drug A
Odd ration = 0.1/0.2 = 0.5 (with drug A as the focus group) this means that it has decreased by 0.5, so there is a 50% decrease in odds of an MI on drug A compared to the placebo
How to find the relative risk of the other focus group e.g think of drug A and placebo scenario
If the risk ratio with the placebo as the focus group is = 1.835
To find the risk ratio with drug A as the focus group, we do 1/1.835 which gives 0.545.
Basically finding the reciprocal
Give two ways to measure and present categorical data
Pie charts
Bar charts
Give three ways to measure and present numerical (quantitative) data
Dot plots
Histograms
Box and whisker plots (box plots)
Give a way to present an association between two continuous variables
Scatter plots
Give some characteristics of a histogram
It can be used to show normal distribution (also called Gaussian distribution)
Can show skewed data which is when data is not symmetrical
Negative skewed data = has long low left tail and peaks at high values on the right
Positively skewed data = has long low right tail and peaks at low values on the left
Give some characteristics of a box plot
The box contains the middle 50% of the data
The line in the box plot shows the median value
Outliers (which are values 1.5 box length from the upper and lower edge of the box) are plotted as dots outside of the whiskers
Give some characteristics of scatter plots
The independent variable is on the X axis and is usually what the experimenter changes
The dependent variable is on the Y axis and is usually the response to what the experimenter changes
Giv three ways that measure the spread of data
Range
Inter-quartile range
Standard variation
What is the variance
Variance = standard deviation ^2 (squared)
Standard variation = square root of variance
What does a large or small standard deviation show
A small standard deviation shows that any random value picked is likely to be close to the mean so small spread of data
A large standard deviation shows that any random value picked is likely to further from the mean so large spread of data
What is the best method to use if there is a symmetric distribution of data
Mean and the standard deviation
What is the best method to use if the distribution of data is non-symmetric
Median and the interquartile range
Methods to summarise categoric variables
Proportion, percentage, risk and odds
Methods to summarise numerical (quantitative) data
Mean, median, range, interquartile range and standard deviation
Methods to quantify differences between two categorical variables
Absolute risk difference
Relative risk ratio
Odds ratio
Methods to quantify the differences between two numeric variables
Persons correlation coefficient (r) r must be between 1 and -1 \+1 shows a positive linear correlation 0 shows no linear correlation -1 shows a negative linear correlation
Methods to calculate the difference between one categoric variable and one numeric variable
If both variables give symmetrical graphs (distribution of data), use mean - mean =
If one of the variables give a non-symmetrical graph (distribution of data), use median - median =
Give the percentage of 1 standard devation, 2 standard deviations and 3 standard deviations.
1 standard deviation = 68%
2 standard deviation = 95.4%
3 standard deviation = 99.8%
Which standard deviation give 95% of the distribution in a graph
1.96 standard deviation give 95% of the distribution of the graph
If you cannot use a mean for a certain set of data, would you still be able to use standard deviation for that data
No, the standard deviation would be affected by the same issue of being skewed by outliers
All if possible it is always best to use the mean and standard deviation because they include all values in the data so more powerful
Who is Sir Galton Francis and what are his contributions to statistics
1822-1911
Standard deviation, correlation, concepts of regression, medians and ranking
First weather map
How to cut a cake
Attractiveness of cities
What is standard error
It is an estimate of the precision of the representation of the sample to the population
How is the standard error calculated
Standard error = standard deviation/ the square root of the sample size
When can the standard error not be used
When the standard deviation and the mean cannot be used due to how skewed the data is
What are the rules to use the standard error
The data has to be normally distributed
The sample size has to be large enough (more than 20 individuals)
Show how to calculate the confidence interval from the standard error
If the Mean = 18,477 sample size (n) = 12 standard deviation (SD) = 3,732
Standard error = standard deviation/ the square root of the sample size
Standard error = 1077.3
To get a 95% confidence interval, use the 1.96 from the standard deviation
To get a 99% confidence interval, use the 2.58 from the standard deviation
Mean - (1.96 x standard error) = 18,477 - (1.96 x 1077.3) = 16,365
Mean + (1.96 x standard error) = 18,477 + (1.96 x 1077.3) = 20,589
95% confident that the true value of the mean lies between 16,365 and 20,589
If we wanted to get the 99% confidence, we would do mean - (2.58 x standard error) and mean + (2.58 x standard error)
What does a small standard error mean for a sample
Greater precision that the results from the sample are representative of the populations
What does a small standard deviation mean
The values are less spread so there is less variability in the sample
What is correlation used to explore
- how two numeric continuous variable are related
- the strength of an association
Give the regression equation and what part of it means
Y= a + bx
Y is the dependent variable (or called outcome or response) the one we measure e.g blood pressure reading, pain score, hours of sleep
X is the independent variable (or called predictor or explanatory) e.g age, deprivation level and family history of illness)
A is the y intercept (or called the constant)
B is the coefficient - the change in y when we increase x by 1 unit
Give the name of the type of regression where the outcome is a single continuous variable e.g sleep time
Linear regression
Give the name of a regression which has a binary outcome e.g pass or fail
Logistic regression