1B Statistics Flashcards
combining probabilities: OR
- If event A and B are mutually exclusive (ie dice roll)
p(A or B)= p(A) + p(B) - if event A and B are not mutually exclusive
p(AorB)= p(A) + p(B) - p(AandB)
(unless you subtract the probability of pAandB then the probability of both occurring is included twice, wrongly inflating the probability
Combining probabilities: AND
p(AandB)= p(A) x p(BlA)
p(BlA)= probability of B given A has occurred. IF A has no impact on B then p(BlA)= p(B)
If a smoking cessation results in 0.4 chance of quitting. Adam and Ben never meet, what is the probability of at least one of them quitting
Events are not mutually exclusive so p(AorB) = P(A)+p(B) - p(AandB)
= 0.4 + 0.4 - 0.16
= 0.64
Sampling error
Sampling error is chance variation (as long as the study is unbiased) between the values obtained for the study sample and the values which would be obtained if measuring the whole population.
The most common method for measuring the likely sampling error is to calculate the standard error
Standard error
-estimates how precisely a population parameter (ie mean, proportion, difference between means) is estimated by the equivalent statistic in the sample
The standard error is the standard deviation of the sampling distribution of the statistic
The method of calculating the standard error therefore depends on the data type and the statistic being used (ie is the data continuous/binary are you calculating mean or proportion. All require different formulas for standard error)
sampling distribution
Could be created by drawing many random samples of the same size from the same population and calculating the same sample statistic. The frequency distribution of all these sample statistics is a sampling distribution.
These distributions (ie a normal distribution) helps you understand how a sample statistic differs from sample to sample and are the basis for making inferences from sample to population.
The shape of the sampling distribution depends on the type of statistic ( ie cont data and mean= normal distribution, binary data and proportion = binomial distribution)
How are confidence intervals calculated and how do they relate to standard error and the sampling distribution
-sampling distribution of a mean is a normal distribution and other stats (ie proportion or rate) can be approximated by a normal distribution
- in a sampling distribution the mean value is equivalent to the true population parameter
- standard deviation is equivalent to the standard error of the sampling statistic
- therefore 95% of sample statistics would lie within 1.96 standard errors of the true population parameter
- from this we can infer that there is a 95% chance that the true population parameter lies within 1.96 standard errors above or below a sample statistic
value used for 99% confidence intervals
+/- 2.58x standard error
interpret 3 possible scenarios when comparing 2 confidence intervals
- CIs do no overlap
- CIs overlap neither point estimate is within the others confidence interval
- Either point estimate is within the confidence interval of the other
- significant at the 5% level
- Unclear- need a significance test
- Not significant at 5% level
formula for conditional probability ( the p(BIA) )
p(BIA)= probability of B given A has occurred
we know:
p(A and B)= p(A) x p(BIA)
rearranged
p(BIA)= p(AandB)/p(A)
what is a statistical distribution?
A function that shows all the possible values of a variable and the frequency that they occur
Statistical distributions: the normal distribution
- symmetrical bell shaped curve
- described by 2 parameters:
variance (SD squared)
mean - The standard normal distribution has a mean of 0 and a variance of 1
-Any normally distributed variable can be converted to a standard normal distribution - the normal distribution is very useful as many variables in biology follow a normal distribution
- the sampling distribution of a mean follows a normal distribution
- with large enough samples other distributions approximate to the normal distribution
Standard statistical distribution: binomial distribution
- PROPORTIONS
-the binomial distribution shows the frequency of events that have 2 possible outcomes
ie success and fail
-it is constructed using 2 parameters:
n (sample size)
pi (true probability) - when sample size is large it approximates to the normal distribution
- used for:
discrete data with 2 possible outcomes
sampling distribution for proportions - since proportions or probability cannot be negative it has no negative values
In the normal distribution what percentage of the area under the curve is within:
1 standard deviation
1.96 standard deviations
2.58 standard deviations
1= 68%
2= 95%
3=99%
standard statistical distributions: Poisson distributions
- RATES/COUNTS
-deals with the frequency with which an event occurs over a given time ie deaths from MI over a month - used in the analysis of rates
- assumes that the data are discrete, events occur at random and are independent
- described by a single parameter: variance (FOR THE POISSON DISTRIBUTION THE MEAN AND THE VARIANCE ARE THE SAME)
- small samples give an asymmetric distribution and large samples approximate to the normal distribution
- no negative values as a rate cannot be negative
Standard statistical distributions: Students T distribution
- SMALL SAMPLE SIZE
-Bell shaped like a normal distribution but tails are more spread out - Single parameter: degrees of freedom
- as the degrees of freedom increase it approaches the normal distribution
Standard statistical distributions: Chi squared distribution
- right skewed shape
- parameter: degrees of freedom
- as degrees of freedom increase it becomes more like normal distribution
-used in chi squared tests which are used for analysing categorical variables (comparing expected and observed event frequencies)
Standard statistical distributions; F distribution
- right skewed
-values are positive - parameter: a ratio of degree of freedom of the numerator and denominator of the ratio
- uses: ANOVA tests
Degrees of freedom
number of independent pieces of information used to calculate a statistic
what is the difference between standard deviation and variance?
- SD is a measure of how far apart values are in a data set
- variance gives an actual value as to how far numbers in a data set are away from the mean
- SD is the square root of the variance
- SD is in the same units as the data where as the variance is not
Sampling distribution shape:
Outcome variable= continuous
statistic type = mean
Normal shaped sampling distribution
Sampling distribution shape:
Outcome variable= binary
statistic type = proportion/risk
Binomial distribution
Sampling distribution shape:
Outcome variable= binary over time
statistic type = rate
Poisson distribution
what is inference?
The process of drawing conclusions for a population based on observations collected from a sample
what are the 2 main methods of inference?
- Estimation
point estimation (mean, proportion)
Interval estimation- expresses the uncertainty associated with a point estimate eg confidence intervals
-hypothesis testing
assess the likelihood that a given observation in a sample would have occured due to chance
both estimation and hypothesis testing are derived from standard error
measures of data location (5)
- arithmetic mean
- geometric mean
- mode
- median
5 percentiles
measures of data dispersion (5)
- range
- interquartile range
- variance
- standard deviation
- coefficient of variation
measures of data location: arithmetic mean ( how to calculate, advantages and disadvantages)
- all values summed and divided by n
- if a sample arithmetic mean is denoted by xbar
- if a population arithmetic mean is denoted by mu
-advantages: amenable to statistical analysis
- disadvantages: not good for asymmetric distribution, affected by outliers
measures of data location: geometric mean (how to calculate, advantages and disadvantages)
- nth square root of the product of all the values
- advantages: more appropriate for positively skewed distributions
- disadvantages: cannot include any values of 0 or negative
measures of data location: median ( how to calculate, advantages and disadvantages)
- middle values
- advantages: unaffected y extreme outliers, good for skewed distributions
- disadvantages: value determined solely by rank so gives no information on any other values
measures of data location: mode ( how to calculate, advantages and disadvantages)
- most commonly occurring value
- advantages: not generally affected by extreme outliers
-disadvantages: there may not always be a mode, not amenable to statistical analysis
measures of data location: percentiles (how to calculate, advantages and disadvantages)
- data is ranked and divided into 100 groups where 100th percentile is the biggest
- advantages: useful for comparing measurements (BMI, child height etc)
- disadvantages: comparisons at the extreme ends of the spectrum less useful than those in the midde
measure of data dispersion: range ( how to calculate, advantages and disadvantages)
- highest value minus lowest
- advantages: simple, intuitive
- disadvantages: sensitive to size of sample and outliers
measure of data dispersion: interquartile range (how to calculate, advantages and disadvantages)
- the middle 50% of the sample
- calculated as the upper quartile- lower quartile
-advantages: more stable than the range as sample size increases - disadvantages: unstable for small samples, does not allow for further mathematical manipulation
measure of data dispersion: variance ( how to calculate, advantages and disadvantages)
- average squared deviation of each value from its mean
- the formula differs slightly depending on whether calculated for a sample (divided by n-1) or a population (divided by n)
- advantages: takes all values into account, useful for making inferences about population
- disadvantages: units differ from that of the data
measures of data dispersion: standard deviation (how to calculate, advantages and disadvantages)
- square root of variance
- advantages: most commonly used, units are the same as data, useful for making inferences about the population
- disadvantages: sensitive to some extent to extreme values
measures of data dispersion: coefficient of variance (how to calculate, advantages and disadvantages)
- ratio of standard deviation to the mean
- gives an idea of the size of the variance relative to the size of the observation
- advantages: allows comparison of the variation of populations that have significantly different values
- disadvantages: where the mean value is near 0 the coefficient of variance is highly sensitive to changes in standard deviation
6 key elements to mention when describing a graph in the exam
- type of graph
- the axes
- the data displayed (ie mortality)
- the units
- any obvious findings
- what interpretation, if any, can be made from the findings (remember very unlikely to be able to conclude causality from a graph)
Displaying categorical data: 2 types of graph
- bar graph
- pie chart
categorical data: bar graph
- bars can show frequency (total count) or relative frequency (percentage)
categorical data: Pie chart
- start at 1200 position and wedges should descend clockwise in order of size (ie biggest –> smallest clockwise)
continuous data: 6 types of chart
- stem and leaf display
- box plot
- histogram
- frequency polygon
- frequency distribution
- cumulative frequency distribution
continuous data: stem and leaf display (what is it and advantages disadvantages
- a quick technique for displaying numerical data graphically
- a vertical stem is drawn consisting of the first few significant figures of values in a dataset
- any subsequent figures are the leaf
- back to back stem and leaf displays can be used to display multiple data sets
advantages:
1. simple quick and easy
2. actual values are retained
disadvantages:
3. hard to display large data sets
continuous data: box plot (what is it and advantages disadvantages
- gives a measure of central location (MEDIAN)
- shows 25th can 75th percentiles so gives range and interquartile range
-Advantages:
1. box element contains a lot of information
2. good for comparing 2 datasets
Disadvantage:
1. actual values are not retained
continuous data: histogram (what is it and advantages disadvantages
- divides the sample values into many intervals which are called bins
- bars then display the number if values in that bin
- most histograms use bins that a roughly equal in width but can aim to size bins so they contain an approximately equal number of sample (this can result in bins that are 2 narrow to see!)
advatages:
1. gives idea of data central tendancy
2. demonstrates skewness and the shape of the frequency distribution
disadvantages:
1. cannot read exact values as in intervals
2. more difficult to compare 2 data sets
3. can only be sued with continuous data
continuous data: frequency polygon (what is in and advantages disadvantages
constructed by joining the midpoint of the top of each bar in the histogram
continuous data: frequency distribution (what is in and advantages disadvantages
-essentially the frequency polygon that would be drawn for a histogram with a very large number of bins
-leads to a smooth line
remember you describe the skewness of a graph according to WHERE THE TAIL IS
continuous data: cumulative frequency (what is it and advantages disadvantages)
- a running count starting with the lowest value and showing how the number of observations accumulate
continuous data: Showing association between 2 variables: which graph type?
- bivariate data is almost always best shown using a scatter plot
continuous data: scatter plot for showing association between 2 variables
- data from 2 variables are plotted against each other to explore the relationship between them
- trend line is drawn to explore whether any correlation is:
- positive negative or non existent
- linear or non linear
- strong, moderate or weak
advantages:
1. data values and data set are retained
2. shows a trend in data relationship
3. shows minimum maximum and outliers
disadvantages:
- data from both variables must be continuous
- hard to visualise large data sets
Z test: what is it and how is it used
Used to compare proportions/means between 2 groups.
Different formulas for testing different things but all include the standard error
The z value is looked up in a z-distribution table which gives a P value.
The test can be used for paired data. To do this the difference in the observation for each pair is calculated and then the pair is treated as a single observation
z test value of significance
z score > 1.96 is significant at the 5% level
T test: what is it, when is it used
Used to compare means/ proportions between 2 groups when a sample size is small (normally less than 60)
Based on a T distribution rather than a normal distribution.
T values are looked up in a T distribution table in order to discern the P value.