Statistics Flashcards
Normal distribution
Describes the probability of getting a certain value in a population.
Symmetric around the mean and the median.
Doesn’t change with a change in sample size
What portion of the population is represented by mean +/- 1 standard deviations
Mean +/- 1 standard deviation = 68% of population
What portion of the population is represented by mean +/- 2 standard deviations
Mean +/- 2 std deviation= 95% of the population
What portion of the population is represented by mean +/- 3 standard deviations
99.7%
What can you construct using normal distribution
Reference interval
Standard error
Not a measure of variability in population.
It is the standard deviation of the sampling distribution. Measures the precision of the estimate (i.e how reliable is our mean)
Inversely proportional with sample sizes.
Standard error - quantifies the variation in means from multiple sets of measurements
Standard deviations - quantifies the variation within a set of measurements.
Define variance and standard deviation
Range, standard deviation and variance both measure the spread or variability of a data set i.e dispersion.
Range = biggest - smallest number
Variance and standard deviation have a close relation. Variance is the SD squared.
- average of the squared differences from the mean. Variance gives you a sense of outliers.
Standard deviation = square root of variance.
A measure of how spread out numbers are. Standard deviation is more proportionate to average distance from the mean.
Define sensitivity
Proportion of people with the disease who test positive
Probability of a positive test given you have the disease
Define specificity
Proportion of people without the disease who test negative
Probability of a negative test given you don’t have the disease
Define PPV
Of those with a positive test how many have the disease.
Not an intrinsic property of the test itself, influenced by prevalence of disease.
Define NPV
Of those with a negative test, how many don’t have the disease
Define accuracy
How close a given set of measurements (observations or readings) are to their true value.
What is a continuous diagnostic test
A test which gives a continuous measure.
We determine where to put cut off (somewhat arbitrary).
If we alter the cut off we change sensitivity or specificity.
Increase the value - increases specificity and decreases sensitivity
What is a sampling distribution
The distribution of that statistic, considered as a random variable, when derived from a random sample of size.
How do you calculate the SE
SE = Standard deviation / square root of sample size
If there is no true difference between populations then what is the mean
0
What is the P value
The probability we would observe a difference in the sample means this large or larger, just by chance/ if there were no true difference
Define the null hypothessis
No true difference.
Define the alternative hypothesis
True difference
What are the steps in a hypothesis test
- Set up a null and alternative hypothesis
- Set the significance value (0.05 usually)
- Calculate the likelihood of the observed effect under the assumption that the null hypothesis is true (p-value)
- If the data are too unusual consistent with the null hypothesis; conclude that it is not true. p <0.05 reject null, p>0.05 do not reject null hypothesis
(we dont reject the alternative or accept the null)
What is type 1 error
The error of concluding there is a difference when there is not (false positive)
What is type 2 error
The error of concluding there is no difference when inn fact there is (false negative)
A statistically significant difference could mean either
There is a true difference OR there is no true difference, this study just observed unusual results (Type 1 error, false positive)
No statistically significant difference could mean either
There is no difference
OR
There is a true difference but we did not detect it with this study (Type 2 error, false negative)
What is continuous data
Capable of being expressed as numbers
e.g height, weight, serum bilirubin
What is paired data
Two populations of numbers in which the same variable has been measured on the same population usually at two different times, or under two different conditions.
e.g before and after a treatment.
In clincal trials unpaired T tests are used because patients are randomised intor groups (e.g type of anasthetic used for an operation)
Paired t tests remove subject to subject variation
What are ordinal scales
Have mutually exclusive classes but there is an order between them.
I.e Can be ranked or ordered, falls between 2 extremes
Can be given as frequncies
Mean calculates with caution.
What is nominal data
Also known as categorical or qualitative.
Consists of classifying the observations into mutually exclusive classes.
I.e can be put into various categories but no specific hierarchy exists
e.g sex, colour
Can be given as frequencies
Mean cannot be calculate.d
What is central tendency and what are two measures commonly used. What are limitations to each of them.
Central tendency
- The average
- Commonly used ones are mean and the median, mode is used less frequently.
- Median: Middle number.
- Mean: Arithmetic mean
Best used when observations are symmetrical (i.e evenly distributed) not as good when there are outliers, can skew results. - Mode: Most frequent number
How is nominal data (and most of ordinal data) best expressed
Series of relative frequencies
Left tailed test
A left-tailed test is used when the alternative hypothesis states that the true value of the parameter specified in the null hypothesis is less than the null hypothesis claims.
Critical value will be negative.
What is a right tailed test
A right-tailed test is used when the alternative hypothesis states that the true value of the parameter specified in the null hypothesis is greater than the null hypothesis claims
Critical value will always be positive
(a threshold that is used to determine whether or not to reject the null hypothesis).
Direction of test is indicated in the alternative hypo
thesis and not in the null hypothesis.
Two tailed test
Non directional
Used when population parameter is DIFFERENT from hypothesised value.
Usually has 2 critical value
What is a measure of skewness
The closer the mean and median are together, the more symmetrical the distribution.
We can get a crude measure of skewness by subtracting the median from the mean.
What are measures of dispersion (including ones used in parametric and non-parametric data)
Explain how the observations are spread around the central measure.
For parametric data SD describes the dispersion of values around the mean.
Non parametric - Percentiles are used to describe the values around the median value.
How do you describe dispersion in non-normal data
Interquartile range.
It is defined as the difference between the 75th and 25th percentiles of the data.
Correlation coefficent
The degree of association between 2 variable
expressed at -1 to +1
-1 = negative correlation
0 = no correaltion
+1 = positive correlation
The correlation coefficient is a mathematical interpretation that is devoid of any cause or effect implications.
It is best to regard the correlation technique as a type of investigative analysis because it suggests areas for further research, rather than as testing hypotheses.
When should you question the use of standard deviation
A standard deviation that is greater than one-half of the value of the mean should raise questions about the adequacy of the standard deviation as a summary statistic.
How do we represent the probability of making a type 1 and type 2 error
Alpha = type 1 error
Beta - Type 2 error
What is multiple testing for statistical significance
any instance that involves the simultaneous testing of more than one hypothesis
What is the power of a test
expressed by the statistic 1-beta
Reflects the ability to reject a false hypothesis (usually set between 70-90%)
What is delta
The difference in response rates between the groups that would be of biological or clinical interest.
Concordance
Agreement between measurements refers to the degree of concordance between two (or more) sets of measurements.
Statistical methods to test agreement are used to assess inter-rater variability or to decide whether one technique for measuring a variable can substitute another.
It is evaluated by tests such as Kendall’s tau.
Measurements made by two (sometimes more than two) different observers or by two different techniques produce similar results.
Impact of multiple comparison and how to control
Multiple comparison
- The more statistical tests you do the more likely it is you’ll get a false positive result.
- Can use Bonferroni correction, sidak, Holms or Tukeys procedure to correct for multiple comparison
What is positive correlation
Increase in one variable leads to the increase in another.
What is negative correlation
Increase in one variable leads to a decrease in another.
Tests used for parametric and non-parametric correlation
Parametric - Pearson
Non-parametric - Spearmen
Linear Regression
The regression equation representing how much y changes with any given change of x can be used to construct a regression line on a scatter diagram
Multiple linear regression
Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable. You can use multiple linear regression when you want to know:
How strong the relationship is between two or more independent variables and one dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
The value of the dependent variable at a certain value of the independent variables (e.g. the expected yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).