2 Flashcards

1
Q

What is The most important point to remember when selecting a study design, including the choice of statistical test. Outline what this entails

A

the stats must test the hypothesis.

The patient population or study sample (i.e. inclusions and exclusion criteria) must be selected to allow for a comparison that will test the hypothesis. The patient outcome measures (i.e. variables) must also be useful for testing the hypothesis. Once the data has been collected (or even before) the best way to select the most appropriate statistical test for the hypothesis is to use a statistical decision tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Outline the statistical decision tree

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is another way at look at a ‘correlation’ and a ‘ comparison between groups’

A

differences and similarities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe the different types of data

A

The second question in the decision tree asks what type of data the measured variable(s) are and it distinguishes between continuous and discrete. Continuous and discrete variables are two types of quantitative data.

However, in addition to quantitative data, there is also categorical and qualitative data (see figure 2 below). It is important to note that the statistical decision tree can only be used for statistical testing of quantitative data. The process for analysing qualitative data is very different and something that we will explore in the summer term within the RDS workshops.

Quantitative data is numerical information about quantities; It can be subdivided into:

discrete (counted)and continuous (measured).Discrete data is a count of that cannot be made more precise. Continuous data can be divided and reduced to finer and finer levels.There is an overlap between discrete and continuous data. For example, age is a discrete variable if going by the number of years and continuous if looking for the exact age in months, days, hours minutes or seconds.

Qualitative data is information about qualities; that is information that can’t actually be measured. Qualitative data deals with descriptive information such as free-text comments to a open-ended question or responses to an interview

Categorical data is in-between quantitative and qualitative data because the ordinal aspects can be easily converted into numerical data. For example a scale on happiness can be given in numbers instead of words. Whereas nominal categorical data is more like qualitative data but the data consists of individual terms rather than sentences in qualitative data.

There are some variables that could be measured quantitatively or qualitatively. For example, eye colour can be measured quantitatively by assessing the RGB scale or qualitatively by categorising into blue, brown or green etc.

Broadly speaking, when you measure something and give it a number value, you create quantitative data. When you classify something, you create categorical data and when you judge something you create qualitative data. So far, so good. But this is just the highest level of data, there are also different types of quantitative data which you must understand to be able to use the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the differenc between nominal and ordinal catergorical data

A

Categorical data can also be subdivided into two types: nominal and ordinal. As a general rule nominal data are unordered and ordinal data are (yes you guessed it) ordered.

Nominal data are items that are assigned individual named categories that do not have an implicit or natural value or rank. For example, gender (male or female) or fracture incidence (yes or no).

Ordinal data are items which are assigned to categories that do have some kind of implicit or natural order, such as ‘small, medium, or large’. Ordinal variables are often used to describe a patient’s characteristics e.g. stage of hypertension, pain level, and satisfaction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why is understanding the distribution of data so important?

A

Normality is most fundamental assumption to test when choosing a statistical test. Simply put, the mathematics underpinning most statistical tests rely on the data having a normal distribution (i.e. two-thirds of data is within one standard deviation of the mean), and that the distribution is symmetrical (i.e. 50% of data is above the mean average and 50% is below).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does normality measure

A

Normality measures the central tendency and dispersion of data and is used to decide how to describe the properties of large data-sets i.e. the descriptive statistics which are presented instead of the raw data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why is it important to be able to identify whether data is normally or not normally distributed

A

it is important to know the distribution when choosing statistical test..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Graph used to determine whether data is normal

A

histogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe the three basic destributions of data

A

The red histogram is a sample from a normal distribution that has a symmetric distribution with well-behaved tails i.e., many data points at the central region of the range and a symmetrical disruption either side. You will sometimes hear a normal distribution described as ‘bell curved’ or ‘Gaussian’. The green and blue histograms are from samples that are not normally distributed.

Skewness

Skewed data (blue curve) which is a-symmetric with many data points in the high or low end of the range and an uneven tail (long on one side and short on the other). A left-skewed distribution has a long left tail. Left-skewed distributions are also called negatively-skewed distributions. That’s because there is a long tail in the negative direction on the number line. The mean and median are also to the left of the peak (see figure 4). A right-skewed distribution has a long right tail. Right-skewed distributions are also called positive-skew distributions. That’s because there is a long tail in the positive direction on the number line. The mean and median are also to the right of the peak. Kurtosis describes data that are heavy-tailed or light-tailed relative to a normal distribution. Data sets with high kurtosis tend to have heavy tails, or outliers that create a very wide distribution. Data sets with low kurtosis tend to have light tails, or lack of outliers that create a very narrow distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do one assess the distribution of their data?

A

Normality can be visually assessed by evaluating a frequency bar-chart or histogram. Both graphs will quickly reveal – by eye – whether the data is normal or not normal. There are more accurate ways if you are so inclined. There are lots of mathematical ways to determine whether your data is normally distributed. The statistical tests of normality are:

Shapiro-Wilks test: used to test for normality with small sample sizes (n<50)

Kolmogorov-Smirnov: used to test for normality with large sample sizes (n>50)

Where n is the number of samples in the data set (i.e. sample size). These tests can be undertaken using statistical software (SPSS, Stata, R) and a p-value <0.05 is considered to indicate a violation of normality (i.e. the data are NOT normally distributed). There are also websites that will allow you to do this freely e.g. Shapiro Wilk or Kolmogorov-Smirnov

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Summarise the need for descriptive statistics

A

Once the data has been collected it needs to be organised into a format that will make it easier to understand. Descriptive statistics are used to categorise large data-sets into a tangible format. The most basic type of descriptive statistic is a measure of central tendency, which can be either the mean, mode or the median. These measures of central tendency and dispersion are commonly used to describe the properties of large data-sets i.e. the descriptive statistics are presented instead of the raw data. male systolic blood pressure sBP) could be presented as 117.6 ± 12.2 mmHg (mean ± SD) and the female sBP could be presented 106.2 ± 2.4 mmHg (mean ± SEM).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Outline the different measures of dispersion

A

light work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

DIfferentiate between dependent and independent data and exemplfi what longitudinal and cross sectional studies are

A

The last question in the statistical decision tree asks whether the groups that are being compared are dependent or independent. This question is asking whether each group is composed of the same subjects of interest, or if they are different. Consider the example below. On the left-hand side we can see paired data, and example of dependent groups. The fourth observation in each group is from Sally, and it is important for the test to take this into account, even though we’re testing between the average measurements. It helps to take into account other factors that can be controlled for that may have affected measurements (e.g. genetics, height, perseverance). The right-hand examples show unpaired data, with two different groups with different individuals in. An example of independent groups.

Typically, paired observations arise from measuring the same variable in the same subject at different time-points (this could be referred to as a longitudinal experiment). Unpaired, or independent, observations are seen when comparing two groups with no common factors (this could be referred to as a cross-sectional study).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain which type of tests are better when testing for differences

A

Wherever possible it is better to use parametric rather than non-parametric statistics. Parametric tests are easier to understand, the analyses are more powerful and they are less likely the incorrectly reject or fail to reject a hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

For non-parametric data with three groups or more , state the test you would use if the data is paired and if the dtat is unpaired

A

Non-parametric, 3 groups or more, paired – Friedman test

Non-parametric, 3 groups or more, unpaired – Kruskal-Wallis

17
Q

Explain the basis behind different parametric tests

A

The most commonly used parametric statistic is the t-test and quite often it is used incorrectly! There are two main variants of the t-test:

Paired t-test: different variables are compared from the same sample

Unpaired (independent) t-test: the same variable is compared but from different samples

T-tests should only be used to compare means from two samples with a normal distribution and although various information is provided from a t-test, you only need to report the p-value.

One-way ANOVA

This test is used to compare the means from more than two samples with a normal distribution. Once again you only need to present the p-value that is generated. However, it is important to note that a one-way ANOVA will only tell you if a difference exists between your samples e.g. it will inform you if sample A, sample B and sample C have different means it will not tell you where the difference is i.e. is it between A&B, A&C or B&C? To calculate exactly where the difference is you will need to undertake a post hoc test such as a Tukey post hoc test or a Bonferonni post hoc test.

There are also variants of the ANOVA test:

Repeated measures one-way ANOVA: This is essentially the ANOVA ‘equivalent’ of a paired t-test. The same sample is used but it is compared more than two times.

Two-way ANOVA: whilst the one-way ANOVA compares an independent variable to a dependent variable, the two-way ANOVA compares a two independent variables. Two-way ANOVAs are quite complex and beyond the scope of this course.

MANOVA: this is a multivariate ANOVA test, which once again is beyond the scope of this course.

18
Q

Explain the importance of different values in pearson’s and spearman rank correlation

A

Pearson’s correlation

A Pearson correlation coefficient will give you an r-value, which tells you how strong the relationship is. It will vary between -1 (represents a perfect negative correlation) and +1 (represents perfect positive correlation).

A general rule of thumb that will help when considering the strength of the correlation is:

(±) 0-0.2: very low correlation

(±) 0.2-0.4: low correlation

(±) 0.4-0.6: reasonable correlation

(±) 0.6-0.8: high correlation

(±) 0.8-1.0: very high correlation

A Pearson correlation may also give you a p-value. The p-value in this case tells you how reliable the r-value is. The smaller the p-value, the more reliable the r-value.

The r2-value can also be reported from a Pearson correlation. This represents how closely your data is fitted to the correlation line. A similar rule of thumb applies with both the r and the r2-values i.e. the higher the r2-value the more reliable your conclusion can be.

Spearman rank correlation

If, however your data are not normally distributed, you should use a Spearman’s Correlation test to identify linear trends. The statistical output provides two key pieces of information:

A correlation coefficient (Spearman’s rho, denoted by ρ) is the equivalent of the Pearson r-value.

The p-value, once again, tells you how reliable the rho-value is. The smaller the p-value, the more reliable the rho-value.

19
Q

Explain the difference btween correlation and regression

A

Correlation and regression and very easily confused. In the most simple terms, correlation indicates the strength of the relationship between two variables. Regression quantifies the association between the two variables i.e. it tells us the impact that changing one variable will have on the other variable.

It is defined by a simple equation: y = a + bx

Where:

a= the y-axis intercept value

b= the gradient of the the line, i.e. the regression coefficient

20
Q

Explain what the The chi square test is and what it is used for

A

The chi-squared test is primarily used to test for differences between two discrete/categorical variables, however it can also be used to look for similarities.

It is a measure of the differences between observed and expected frequencies and although the test is called the chi-squared test and is represented by the Greek symbol χ(or χ2) the statistic reported is the Roman symbol Χ (or Χ2).

If the observed and the expected frequencies are the same then Χ2 = 0 and the higher the value of Χ2 the bigger the difference between the observed and the expected frequencies. There are various factors that impact the Χ2value and therefore it can be difficult to interpret. Hence the p-value is once again used to to report a significant difference.