Statistics Flashcards

1
Q

What different types of data are there?

A

Quantitative - ‘values you can do maths on’, which is further subdivided into Continuous and Discreet:

Continuous data values depend on the unit and precision of measurement. E.g. height, BP, etc.

Discreet data values are whole numbers. E.g. number of children, length of stay in days

Quantitative data can also be described as:
interval data, where a value of zero does not mean ‘no measurement’, e.g. temperature in ‘C
vs
ratio data, where zero means no measurement, e.g. heart rate, height, etc.

Qualitative - ‘values you can’t do maths on’, which is further subdivided into Ordinal and Nominal:

Ordinal data values can be arranged in a particular order or ranking, but the increments between groups are not equal, not known, or not measurable. E.g. ASA grade

Nominal data has groups with different labels, but no order or ranking. E.g. Gender

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do you choose which statistical test to use to analyse data?

A

In order to analyse data with appropriate statistical tests, we must first look at the characteristics of the data.

  • Nature of the data (Quantitative, either continuous or discrete vs Qualitative, either ordinal or nominal)
  • Distribution of the data (Normal/Parametric vs. Non-normal/Non-parametric)
  • Number of groups (2 vs more than 2 groups)
  • Paired vs unpaired groups
    (Paired data is when comparison is made of the same group under two separate scenarios, whereas unpaired data is when comparison is made of two unrelated or independent groups)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is normal/parametric and non-normal/non-parametric data?

A

These terms refer to the distribution of data within a dataset.

Qualitative data is always non-normal/non-parametric.

For quantitative data, a distribution curve can be created by plotting observed values on the x axis, and frequency on the y axis. If data is normally distributed, the curve is symmetrical and bell-shaped, and the mean, median and mode are the same.
If the data is non-normally distributed it is asymmetrical (such as skew in either direction) or not bell-shaped (such as bimodal distribution).

With skew, if the tail extends to the right this is rightward or positive skew, and if the tail extends to the left this is leftward or negative skew. For skewed data the mode, median and mean will not be the same. The mode is the most frequently occurring value and is the peak of the curve. The median has equal numbers of values above and below it, and moves towards the tail of the skewed data. The mean is the average of all the recorded values and also lies towards the tail of the skewed data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How would you describe data?

A

Quantitative data can be described according to its:

Central tendency (i.e. mean, median, and mode)
Mean (x̄) = Σx/n; Median = middle value; Mode = most common/frequent value.

For parametric data, we usually use the mean. For non-parametric data, we usually use the median value.

Spread (i.e. variance and standard deviations) of parametric data
Variance = Σ(x-x̄)^2 / (n-1)
[in original units squared]
SD = √ Σ(x-x̄)^2 / (n-1) [in original units]; SD1 includes 68% data, SD2 includes 95% data (i.e. 2.5% at either end excluded), SD3 includes 99.7% data.

Range (the lowest and highest values recorded) and interquartile range (IQR) for non-parametric data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How would you display qualitative data in graphical form?

A

Data is non-numerical, and each group has a label. Data can first be displayed in a frequency table, and then in a bar or pie chart. Each variable can be given a percentage of observation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can you transform non-parametric data into parametric data?

A

Transformation of data refers to the application of mathematical functions to a dataset in an attempt to transform non-parametric data into parametric data, allowing more powerful statistical tests to be applied.

Right skew - √ or Log
Left skew - x^2
Exponential - Log or 1/x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How do you calculate variance?

A

Variance = Σ(x-x̄)^2 / (n-1)
[in original units squared]

Calculate the mean. Calculate all measured values minus the mean and square these values so that they are all positive values. Take the sum of all these values and divide by the degrees of freedom. The answer will be in the original units squared.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you calculate standard deviation?

A

The standard deviation of parametric data is calculated as the square root of the variance.

SD = √ Σ(x-x̄)^2 / (n-1) [in original units]

SD1 includes 68% data, SD2 includes 95% data (i.e. 2.5% at either end excluded), SD3 includes 99.7% data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is meant by the standard error of the mean?

A

The standard error of the mean is used to determine whether the mean of the sample reflects the mean of the whole population.

It is inherent that the larger the sample size, the more likely the mean of the sample will reflect the mean of the whole population.

Also if the standard deviation is small, and hence the variance around the mean is small, then again you can be more confident that the mean of the sample is close to the mean of the whole population.

The standard error of the mean is calculated by dividing the standard deviation by the square root of the degrees of freedom.

SE = (√ Σ(x-x̄)^2 / (n-1)) / √(n-1)

The standard error of the mean can be thought of as the standard deviation of the mean, so it can be said that 68% of sample means will lie within 1 SE of the true population mean, and etc. for 2SE (95%) and 3SE (99.7%).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are confidence limits?

A

Confidence limits are used to describe the range over which the likely true answer will fall OR the range of plausible values base on the observed sample. [It is NOT the range over which we can be 95% confident the true value lies]

Confidence limits are related to the SE of the mean. The range between 2 SE’s above the mean and 2SE’s below the mean is called the confidence interval (CI), and values at either end the confidence limits.

Strictly speaking a 95% confidence interval means that if we were to take 100 different samples and compute a 95% confidence interval for each sample, then approximately 95 of the 100 confidence intervals will contain the true mean value (μ).

The confidence limits have the same value as the data measurements which make the much easier to interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Can you use the SE of the mean for non-parametric data?

A

No!

For data that is skewed, the standard deviation does not accurately reflect the variation of data around the mean. Therefore it is impossible to calculate the SE of the mean.

Instead, for non-parametric data we tend to quote the median, the range, and the interquartile range within which the middle 50% (i.e. quartiles 2 and 3) of the results lie.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do we mean by the p value?

A

It is numerically the same as the alpha error.

The p value is the probability of getting the observed data, or something more extreme, when the null hypothesis is true.

How odd/surprising is the result I have observed if the null hypothesis is true?

If the result is sufficiently odd, we would assess the size of the difference, the reproducibility of the findings in other studies, and try and determine whether we should reject the null hypothesis.

[It is NOT the probability the null hypothesis is false - if you compared two identical interventions and there was a natural variation in observed results, the p value would clearly not be the probability that the difference was caused by the observed intervention]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the null hypothesis?

A

When comparing groups, the null hypothesis states that there is no difference between them with respect to a particular variable. A research study will then go on to try and disprove the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a type 1 error?

A

A type 1 error, also referred to as an alpha error, is:
- a false positive
- the null hypothesis is wrongly rejected
- a difference is found when there is none

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a type 2 error?

A

A type 2 error, also referred to as a beta error, is:
- a false negative
- the null hypothesis is accepted when there is actually a difference between groups

A type 2 error is affected by sample size, variation in the study population, and when a small difference is clinically important.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is meant by the power of a study?

A

Power = 1 - β

It can be thought of as, “How hard do I have to look, in order that when I don’t find a difference, I can believe that it’s because it is not there, and not simply because I didn’t look hard enough.”
It’s not “what do I have to do to find the difference?”

It is calculated prior to a study in order to determine the sample size.

You need to decide:
1. The size of the difference between groups you would consider clinically significant. (Analogy: the needle)

  1. The spread of values you would expect to see, i.e. the standard deviation (obtained from previous studies). (Analogy: the haystack)

If the size of the difference you are looking for is very small, and the spread of that variable is very wide, you will need a large sample size. Otherwise when you come back and say there is no difference there, we will say you didn’t look hard enough.

  1. The alpha error - what p value am I looking for in order to accept that I’ve found a difference. If the alpha error is very small, a large sample size will be required.
  2. The test. Parametric tests are more powerful and non-parametric tests are less powerful.
17
Q

What tests would you use to analyse quantitative data?

A

First you must determine whether the quantitative data is normal or non-normal in distribution, and how many groups there are.

For normally distributed data in 2 groups, the student’s t-test should be used. This can be either paired or unpaired.

For normally distributed data in 3 groups, ANOVA (analysis of variance), or alternatively multiple t-tests with a Bonferroni correction should be used.

For non-normally distributed quantitative data in 2 groups, the Mann Whitney U test or the Wilcoxian signed-rank test should be used.

For non-normally distributed quantitative data in 3 groups, it is the Kruskal Wallis test.

Of note, you can use non-parametric tests for parametric data (although they are less powerful). You should NOT use parametric tests for non-parametric data.

18
Q

What tests would you use to analyse qualitative data?

A

Qualitative data is non-normal. It should be analysed by the Chi Squared test, or for very low numbers the Fisher’s Exact test can be used.

For the Chi Squared test, a contingency table is created which compares observed frequency against expected frequency if there were no difference between the groups.

The chi squared number is the sum of the observed minus observed value squared, divided by the expected value.

X^2 = Σ (O-E)^2 / E

The degrees of freedom is calculated as the number of rows minus 1 times the number of columns minus 1.

dof = (rows - 1) * (columns - 1)

The x^2 and dof are then referenced with a statistical table or inputted into computer software to get a p value.

19
Q

When testing correlation, what tests can we use and in what situations?

A

A correlation coefficient, r, is equal to 0 when there is no correlation between two groups, +1 if there is ‘perfect’ positive correlation, and -1 if there is ‘perfect’ negative correlation. For small amounts of data, a high r value is needed to be considered significant, but with huge amounts of data low r numbers can still be significant.

The coefficient of determination, r^2, gives a % of the change in y (dependent variable) that can be attributable to a change in x (independent variable).

Pearson’s correlation coefficient can be used when both variables are parametric.

Spearman’s rho correlation coefficient is used when either variable is non-parametric.

20
Q

In linear regression, what is least squares regression?

A

You take the mean of x (x bar) and the mean of y (y bar) as the fulcrum, and then find the gradient of line with the smallest cumulative distance of all points from the line.

21
Q

What is a Bland Altman plot?

A

A Bland Altman plot compares 2 measuring devices. It assumes both measuring devices are wrong, and that the true value is the average of the two measurements.

A graph is plotted with the average of each pair of measured values on the x axis, and the difference between each pair of values on the y axis.

Once all values are plotted, a line of best fit and 2SDs is plotted, and so the agreement of the two tests can be visualised and the range in which agreement is clinically acceptable can be determined.

22
Q

Tell me about risk, absolute risk, absolute risk reduction and number needed to treat.

A

Risk is the chance of something happening.

Absolute risk is the event rate, and is usually calculated for each group in a study. E.g.
Event rate in the control group = ARc
and
Event rate in the treatment group = ARt

Absolute risk reduction (ARR) is the event rate in the control group minus the event rate in the treatment group.
ARR = ARc - ARt

Number needed to treat is the reciprocal of the ARR.
NNT = 1/ARR

23
Q

Define relative risk and relative risk reduction.

A

Relative risk, or the risk ratio, is the event rate in the treatment group divided by the event rate in the control group.

RR = ARt / ARc

A relative risk of 1 means no association, of <1 means a negative association, and of >1 means a positive association.

Relative risk reduction is the absolute risk reduction divided by the event rate in the control group.

RRR = ARc - ARt / ARc

It is also 1 - RR.

Both relative risk and relative risk reduction don’t, in themselves, inform you of the magnitude of the risk. That is to say, you should also look at the baseline prevalence to interpret them.

24
Q

Tell me about odds and odds ratio.

A

Odds is the number of events divided by the number of non-events. For example, rolling a fair die, you would expect the odds of rolling a 6 to be 1:5.

It will be calculated for both the control group and the treatment group.

The odds ratio is the odds in the treatment group divided by the odds in the control group.

OR is useful as it can be calculated in case control studies, whereas RR can only be used in cohort studies.

When prevalence is low, OR approximates to RR.