Lecture 12, 13 and 14 - Biostatistics Flashcards
Define sample
A sample is a group taken from the overall population, which we use to make estimates and generalisations about the population
The sample has to be representative of the population
If the method of sampling we use gives us an unrepresentative sample, the results won’t be the true population value.
Reports have a margin of error and confidence interval
Population define
The entire group of people or things that we want information about. Reports are a true representation of opinion.
General overview of the process of sampling
Use a representative sample of the population to make conclusions about the population.
Uses a smaller sample group to represent population
Involves summarising data using tables and graphs as well as inferencing
Census
Taking a sample from an entire population. Time consuming and expensive to test whole population.
Statistics deals with
uncertainty
Why do we take a sample from a population?
Because taking data from the whole population is difficult to investigate and very costly
Proportion
Proportion = number with characteristic/ total number
Percent
Percent = 100% x number with characteristic/ Total number
True proportion/ true population value
The true population values is the statistic we get if we could test the entirety of the population
Increasing sample size…
Would mean there is more certainty with our results
Categorical variables
A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic.
For example - eye colour, stages of cancer, the colour of M&Ms
What can we summarise categorical variables as?
Summarise these types of variables by the number in each category and the percent (or proportion)
Continuous variables
Continuous variable can take on any value.
For example - height, weight, age and blood pressure
Mean
Also known as the central tendency. To find the mean, add up the values in the data set and then divide by the number of values that you added…
Mean = Sum of all/total number of observations
Central value of a discrete set of numbers
What does sampling look like for a continuous variable?
A histogram is often used as it shows the distribution/spread of data.
How to present categorical…
If categorical, we can present proportions or percentages
How to present continuous …
If continuous, we usually want to know where the centre is (central tendency/mean) and how spread out the data is. Often you the mean (central tendency) and standard deviation (spread or variability) for this.
Standard deviation
Standard deviation is a number used to tell how measurements for a group are spread out from the average (mean), or expected value. A low standard deviation means that most of the numbers are close to the average. A high standard deviation means that the numbers are more spread out.
Spread of distribution is determined bu the standard deviation
The main purpose of collecting a sample is …
to make an inference
Parameters
measures which describe a population, such as mean, median, IQR
What is bias and how do you avoid it?
Sampling bias is where there is a specific preference towards on group over other being selected for the sample
An unbiased sample means samples are taken at random, with no preference over certain groups in the population and everyone has a fair chance of being chosen for the study
The sample is not representative if it has bias. Has too many people from a particular group within the population or a group is completely excluded
To avoid bias - then the experiment must gives everyone a chance of being included in the sample for it to be fair and representative of the population
Simple random sampling
The basic sampling technique where we select a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample.
Systematic sampling
Systematic sampling is a statistical method involving the selection of elements from an ordered sampling frame (It is a list of all those within a population who can be sampled, and may include individuals, households or institutions).
What happens to the bell curve when a sample size is increased?
Each sample has more certainty as we have more information therefore have a narrower curve.
What are the two different sorts of errors?
There are two different sorts of errors…
1- Errors that make our answers more uncertain i.e. more variability
2- Errors that move us away from the truth i.e. we get the wrong answer and this is often called bias
You can’t avoid 1st type (taking a sample, measuring things imperfectly). However, it is really important to avoid the 2nd type, as we do not want to undertake a study and get the wrong answer as it takes us away from the true estimate.
A random sample from the whole population (as long as everyone takes part) can avoid the two.
Random sample
A random sample means that everyone/everything has an equal chance of being chosen
Does changing the sample size affect bias?
Increasing the sample size does not help with dealing with bias
How does sampling methods affect a sample?
The sampling method must match the target population in order to get representative results.
What is it important to consider when deciding if a sample is representative or not?
When wondering if a sample is representative it is important to consider the people who won’t take part.
Continuous variables - population is described by … (terminology)
Mean (population) Standard deviation (population)
Continuous variable - sample is described by …. (terminology)
Sample mean (x) Sample size (n) Sample standard deviation (s) = standard deviation of the observations in a sample
Continuous variable - sampling distribution (terminology)
Each circle on a sampling distribution is a sample mean ( a mean for each different sample)
Variability of sampling distribution is called standard error (SE) (it is the standard deviation of sample means)
Sampling distribution is centred on the population mean (when there is no bias) and has its own variability called the standard error, same as standard deviation but for the sample means
Binary variables/categorical variables - population is described by (terminology)
(population) population
Binary variables/categorical variables - sample is described by (terminology)
(sample) proportion
Binary variables/categorical variables - sampling distribution is described by (terminology)
Proportion = population proportion, as the sampling distribution is entered on the population proportion (when there is no bias)
Standard error = variability/ standard deviation of the sampling distribution/spread of different proportions
Normal distribution
Symmetric bell shaped curve, we keep seeing that the sampling distribution follows the shape of a ‘normal distribution’
If we have the mean and standard deviation we can draw its shape (precise curve that only depends on the mean and standard deviation)
The normal distribution is the symmetric bell shaped curve that we keep seeing when we take repeated random samples from a population when the sample size is large. One of the key properties of the normal distribution is that 95% of the observations lie within 1.96 standard deviations of the mean. This is due to the shape of the normal distribution and this property is very useful when making an inference back to the population.
Mean is always at the centre of a normal distribution/a normal distribution is always symmetric and centred at the mean
Where does 95% of the data lie?
95% of the data lies between - 2 standard deviation from the means and + 2 standard deviation from the mean (within 2 standard deviations of the mean)
For large samples around 30 or more…
The sampling distribution will follow normal distribution (symmetric bell shaped curve) AND 95% of the sample means lie within +/- 2 standard errors of the population mean
As the sample size increases, the spread of the sampling distribution …
Decreases
Standard error
If our sample is large (n is greater than 30) then we know the sampling distribution will be normal (symmetric bell curve), then the standard error can be estimated from the sample using the following equation …
SE = standard deviation / square root number of sample
95% confidence intervals
General formula is on desktop …
Where X represents the estimate and the s over square root n is the same as standard error, s represents the standard deviation
This formula ensures that if we did repeated sampling 95% of intervals would contain the true population. Using this formula you can find the upper and lower confidence interval limits.
95% of intervals will contain the true population within 2 standard deviations of the mean (mean - 2sd and mean + 2sd)
Confidence intervals
Confidence intervals are a very useful way of understanding how much uncertainty we have in the mean or proportion. They reflect the width of the sampling distribution. Because we don’t know if our sample is one of the extreme ones or closer to the middle of the sampling distribution, we do not know if our confidence interval contains the true population mean or proportion. All we know is that if we took repeated samples, then 95% of the confidence intervals would contain the true mean or proportion, and 5% would not. This leads to the use of the phrase ‘We are 95% confident’ which means if we did this repeatedly, 95% of the intervals would contain the true population mean or proportion.
The 95% confidence interval is very useful for interpreting our results; if it is wide then we don’t have much certainty about the estimate. If you end up working in clinical practice and you’re looking at the results of how effective a new drug is, the first thing you would want to look at would be the size of the confidence interval, followed by the study design and whether the results are even applicable to your patients.
What happens to confidence intervals if we increase the sample size?
95% still contains the true population mean however the confidence intervals will now be narrower as we now have more information with a larger sample size and therefore more certainty in the values for the sample.
Does the proportion of confidence intervals that contain the population mean change much as the sample size increases?
No - there are small differences due to random variation, but we expect that 95% of all the confidence intervals will contain the population mean
Interpretation of a confidence interval statement
We are 95% confident that the true population mean lies between the lower and upper confidence limit
OR
We are 95% confident that the true proportion lies between the lower and upper confidence limit
Features of a box plot
The ‘25th percentile’ - 25% of the sample is below this point and 75% is above this point
The ‘median’ - 50% of the sample is above this point and 50% is below this point
The ’75% percentile’ - 75% of the sample is below this point and 25% is above this point’
IQR is the range between the 25th percentile and 75th percentile and it contains the central 50% of all heights
Check boxplot image on desktop
Confidence intervals coverage - proportions
We can apply the general formula to proportions using … Proportions +and- 2xSE
Always use the proportion NOT the percentage (i.e. write it between 0-1)
What does a wide confidence interval mean?
It means that there are lots of possible values. It becomes more precise with larger sample size as confidence interval decreases in size. The narrower the confidence interval, the more certainty we have about the size of the population mean. If the confidence interval is wide then we don’t know much about a population.
Comparing groups
If there is a no difference the mean will sit on zero.
To figure out the difference between the two groups then you find the difference in proportion by minusing the two groups.
What happens to the sampling distribution when sample size is increased?
If sample size is increased then the sampling distribution gets narrower
What happens to
mean +/- standard deviation and …
95% confidence intervals
when sample size is increased?
First one stays the same and the second one decreases
Technical variation
Variation that is a result of how a sample is obtained e.g. how measurements are taken, angle of tape measure when height is taken etc.
Biological variation
Sources of variation as a result of biological features such as genetics, nutrition, mutations etc.
What is the best way to present a small amount of data?
Dot plot/boxplot is good for a small amount of data as it shows the data exactly. With just a few data paints, the dot plot would display the data more clearly - you can see exactly what the values are; the shape is not as important. A histogram tends to be very spiky and the boxplot hides the fact that there are very few data points.
When can you remove outliers from a set of data collected?
You need to be certain that they are truly errprs, and will cause more bias if you leave them in than if you exclude them. Ideally you would correct the errors instead of just removing them all together.
As the sample size increases, what happens to the mean of all the sample means?
The mean of all the sample means doesn’t change much - all are estimating the population mean
As the sample size increase, what happens to the standard deviation of all the sample means (standard error)?
Standard error decreases as the sample size increases because with a larger sample size there is less variation in the sample
Regression lines
It is simply a line that best fits the data
Y= Mx + C
Less variation in regression lines as sample size increases and you get a much better sense of what the true relationship is between the variables being investigated
Small samples don’t have as much reliability as shown by the variations in regression lines