Lecture 12, 13 and 14 - Biostatistics Flashcards

1
Q

Define sample

A

A sample is a group taken from the overall population, which we use to make estimates and generalisations about the population

The sample has to be representative of the population

If the method of sampling we use gives us an unrepresentative sample, the results won’t be the true population value.

Reports have a margin of error and confidence interval

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Population define

A

The entire group of people or things that we want information about. Reports are a true representation of opinion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

General overview of the process of sampling

A

Use a representative sample of the population to make conclusions about the population.

Uses a smaller sample group to represent population

Involves summarising data using tables and graphs as well as inferencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Census

A

Taking a sample from an entire population. Time consuming and expensive to test whole population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Statistics deals with

A

uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why do we take a sample from a population?

A

Because taking data from the whole population is difficult to investigate and very costly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Proportion

A

Proportion = number with characteristic/ total number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Percent

A

Percent = 100% x number with characteristic/ Total number

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

True proportion/ true population value

A

The true population values is the statistic we get if we could test the entirety of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Increasing sample size…

A

Would mean there is more certainty with our results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Categorical variables

A

A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic.
For example - eye colour, stages of cancer, the colour of M&Ms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What can we summarise categorical variables as?

A

Summarise these types of variables by the number in each category and the percent (or proportion)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Continuous variables

A

Continuous variable can take on any value.

For example - height, weight, age and blood pressure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Mean

A

Also known as the central tendency. To find the mean, add up the values in the data set and then divide by the number of values that you added…

Mean = Sum of all/total number of observations

Central value of a discrete set of numbers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What does sampling look like for a continuous variable?

A

A histogram is often used as it shows the distribution/spread of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to present categorical…

A

If categorical, we can present proportions or percentages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How to present continuous …

A

If continuous, we usually want to know where the centre is (central tendency/mean) and how spread out the data is. Often you the mean (central tendency) and standard deviation (spread or variability) for this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Standard deviation

A

Standard deviation is a number used to tell how measurements for a group are spread out from the average (mean), or expected value. A low standard deviation means that most of the numbers are close to the average. A high standard deviation means that the numbers are more spread out.

Spread of distribution is determined bu the standard deviation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

The main purpose of collecting a sample is …

A

to make an inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Parameters

A

measures which describe a population, such as mean, median, IQR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is bias and how do you avoid it?

A

Sampling bias is where there is a specific preference towards on group over other being selected for the sample

An unbiased sample means samples are taken at random, with no preference over certain groups in the population and everyone has a fair chance of being chosen for the study

The sample is not representative if it has bias. Has too many people from a particular group within the population or a group is completely excluded

To avoid bias - then the experiment must gives everyone a chance of being included in the sample for it to be fair and representative of the population

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Simple random sampling

A

The basic sampling technique where we select a group of subjects (a sample) for study from a larger group (a population). Each individual is chosen entirely by chance and each member of the population has an equal chance of being included in the sample.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Systematic sampling

A

Systematic sampling is a statistical method involving the selection of elements from an ordered sampling frame (It is a list of all those within a population who can be sampled, and may include individuals, households or institutions).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What happens to the bell curve when a sample size is increased?

A

Each sample has more certainty as we have more information therefore have a narrower curve.

25
Q

What are the two different sorts of errors?

A

There are two different sorts of errors…
1- Errors that make our answers more uncertain i.e. more variability
2- Errors that move us away from the truth i.e. we get the wrong answer and this is often called bias

You can’t avoid 1st type (taking a sample, measuring things imperfectly). However, it is really important to avoid the 2nd type, as we do not want to undertake a study and get the wrong answer as it takes us away from the true estimate.

A random sample from the whole population (as long as everyone takes part) can avoid the two.

26
Q

Random sample

A

A random sample means that everyone/everything has an equal chance of being chosen

27
Q

Does changing the sample size affect bias?

A

Increasing the sample size does not help with dealing with bias

28
Q

How does sampling methods affect a sample?

A

The sampling method must match the target population in order to get representative results.

29
Q

What is it important to consider when deciding if a sample is representative or not?

A

When wondering if a sample is representative it is important to consider the people who won’t take part.

30
Q

Continuous variables - population is described by … (terminology)

A
Mean (population) 
Standard deviation (population)
31
Q

Continuous variable - sample is described by …. (terminology)

A
Sample mean (x)
Sample size (n) 
Sample standard deviation (s) = standard deviation of the observations in a sample
32
Q

Continuous variable - sampling distribution (terminology)

A

Each circle on a sampling distribution is a sample mean ( a mean for each different sample)
Variability of sampling distribution is called standard error (SE) (it is the standard deviation of sample means)
Sampling distribution is centred on the population mean (when there is no bias) and has its own variability called the standard error, same as standard deviation but for the sample means

33
Q

Binary variables/categorical variables - population is described by (terminology)

A

(population) population

34
Q

Binary variables/categorical variables - sample is described by (terminology)

A

(sample) proportion

35
Q

Binary variables/categorical variables - sampling distribution is described by (terminology)

A

Proportion = population proportion, as the sampling distribution is entered on the population proportion (when there is no bias)

Standard error = variability/ standard deviation of the sampling distribution/spread of different proportions

36
Q

Normal distribution

A

Symmetric bell shaped curve, we keep seeing that the sampling distribution follows the shape of a ‘normal distribution’

If we have the mean and standard deviation we can draw its shape (precise curve that only depends on the mean and standard deviation)

The normal distribution is the symmetric bell shaped curve that we keep seeing when we take repeated random samples from a population when the sample size is large. One of the key properties of the normal distribution is that 95% of the observations lie within 1.96 standard deviations of the mean. This is due to the shape of the normal distribution and this property is very useful when making an inference back to the population.

Mean is always at the centre of a normal distribution/a normal distribution is always symmetric and centred at the mean

37
Q

Where does 95% of the data lie?

A

95% of the data lies between - 2 standard deviation from the means and + 2 standard deviation from the mean (within 2 standard deviations of the mean)

38
Q

For large samples around 30 or more…

A

The sampling distribution will follow normal distribution (symmetric bell shaped curve) AND 95% of the sample means lie within +/- 2 standard errors of the population mean

39
Q

As the sample size increases, the spread of the sampling distribution …

A

Decreases

40
Q

Standard error

A

If our sample is large (n is greater than 30) then we know the sampling distribution will be normal (symmetric bell curve), then the standard error can be estimated from the sample using the following equation …

SE = standard deviation / square root number of sample

41
Q

95% confidence intervals

A

General formula is on desktop …

Where X represents the estimate and the s over square root n is the same as standard error, s represents the standard deviation

This formula ensures that if we did repeated sampling 95% of intervals would contain the true population. Using this formula you can find the upper and lower confidence interval limits.

95% of intervals will contain the true population within 2 standard deviations of the mean (mean - 2sd and mean + 2sd)

42
Q

Confidence intervals

A

Confidence intervals are a very useful way of understanding how much uncertainty we have in the mean or proportion. They reflect the width of the sampling distribution. Because we don’t know if our sample is one of the extreme ones or closer to the middle of the sampling distribution, we do not know if our confidence interval contains the true population mean or proportion. All we know is that if we took repeated samples, then 95% of the confidence intervals would contain the true mean or proportion, and 5% would not. This leads to the use of the phrase ‘We are 95% confident’ which means if we did this repeatedly, 95% of the intervals would contain the true population mean or proportion.

The 95% confidence interval is very useful for interpreting our results; if it is wide then we don’t have much certainty about the estimate. If you end up working in clinical practice and you’re looking at the results of how effective a new drug is, the first thing you would want to look at would be the size of the confidence interval, followed by the study design and whether the results are even applicable to your patients.

43
Q

What happens to confidence intervals if we increase the sample size?

A

95% still contains the true population mean however the confidence intervals will now be narrower as we now have more information with a larger sample size and therefore more certainty in the values for the sample.

44
Q

Does the proportion of confidence intervals that contain the population mean change much as the sample size increases?

A

No - there are small differences due to random variation, but we expect that 95% of all the confidence intervals will contain the population mean

45
Q

Interpretation of a confidence interval statement

A

We are 95% confident that the true population mean lies between the lower and upper confidence limit

OR

We are 95% confident that the true proportion lies between the lower and upper confidence limit

46
Q

Features of a box plot

A

The ‘25th percentile’ - 25% of the sample is below this point and 75% is above this point
The ‘median’ - 50% of the sample is above this point and 50% is below this point
The ’75% percentile’ - 75% of the sample is below this point and 25% is above this point’
IQR is the range between the 25th percentile and 75th percentile and it contains the central 50% of all heights

Check boxplot image on desktop

47
Q

Confidence intervals coverage - proportions

A

We can apply the general formula to proportions using … Proportions +and- 2xSE
Always use the proportion NOT the percentage (i.e. write it between 0-1)

48
Q

What does a wide confidence interval mean?

A

It means that there are lots of possible values. It becomes more precise with larger sample size as confidence interval decreases in size. The narrower the confidence interval, the more certainty we have about the size of the population mean. If the confidence interval is wide then we don’t know much about a population.

49
Q

Comparing groups

A

If there is a no difference the mean will sit on zero.
To figure out the difference between the two groups then you find the difference in proportion by minusing the two groups.

50
Q

What happens to the sampling distribution when sample size is increased?

A

If sample size is increased then the sampling distribution gets narrower

51
Q

What happens to
mean +/- standard deviation and …
95% confidence intervals
when sample size is increased?

A

First one stays the same and the second one decreases

52
Q

Technical variation

A

Variation that is a result of how a sample is obtained e.g. how measurements are taken, angle of tape measure when height is taken etc.

53
Q

Biological variation

A

Sources of variation as a result of biological features such as genetics, nutrition, mutations etc.

54
Q

What is the best way to present a small amount of data?

A

Dot plot/boxplot is good for a small amount of data as it shows the data exactly. With just a few data paints, the dot plot would display the data more clearly - you can see exactly what the values are; the shape is not as important. A histogram tends to be very spiky and the boxplot hides the fact that there are very few data points.

55
Q

When can you remove outliers from a set of data collected?

A

You need to be certain that they are truly errprs, and will cause more bias if you leave them in than if you exclude them. Ideally you would correct the errors instead of just removing them all together.

56
Q

As the sample size increases, what happens to the mean of all the sample means?

A

The mean of all the sample means doesn’t change much - all are estimating the population mean

57
Q

As the sample size increase, what happens to the standard deviation of all the sample means (standard error)?

A

Standard error decreases as the sample size increases because with a larger sample size there is less variation in the sample

58
Q

Regression lines

A

It is simply a line that best fits the data
Y= Mx + C
Less variation in regression lines as sample size increases and you get a much better sense of what the true relationship is between the variables being investigated
Small samples don’t have as much reliability as shown by the variations in regression lines