2 Flashcards
What is The most important point to remember when selecting a study design, including the choice of statistical test. Outline what this entails
the stats must test the hypothesis.
The patient population or study sample (i.e. inclusions and exclusion criteria) must be selected to allow for a comparison that will test the hypothesis. The patient outcome measures (i.e. variables) must also be useful for testing the hypothesis. Once the data has been collected (or even before) the best way to select the most appropriate statistical test for the hypothesis is to use a statistical decision tree.
Outline the statistical decision tree
What is another way at look at a ‘correlation’ and a ‘ comparison between groups’
differences and similarities
Describe the different types of data
The second question in the decision tree asks what type of data the measured variable(s) are and it distinguishes between continuous and discrete. Continuous and discrete variables are two types of quantitative data.
However, in addition to quantitative data, there is also categorical and qualitative data (see figure 2 below). It is important to note that the statistical decision tree can only be used for statistical testing of quantitative data. The process for analysing qualitative data is very different and something that we will explore in the summer term within the RDS workshops.
Quantitative data is numerical information about quantities; It can be subdivided into:
discrete (counted)and continuous (measured).Discrete data is a count of that cannot be made more precise. Continuous data can be divided and reduced to finer and finer levels.There is an overlap between discrete and continuous data. For example, age is a discrete variable if going by the number of years and continuous if looking for the exact age in months, days, hours minutes or seconds.
Qualitative data is information about qualities; that is information that can’t actually be measured. Qualitative data deals with descriptive information such as free-text comments to a open-ended question or responses to an interview
Categorical data is in-between quantitative and qualitative data because the ordinal aspects can be easily converted into numerical data. For example a scale on happiness can be given in numbers instead of words. Whereas nominal categorical data is more like qualitative data but the data consists of individual terms rather than sentences in qualitative data.
There are some variables that could be measured quantitatively or qualitatively. For example, eye colour can be measured quantitatively by assessing the RGB scale or qualitatively by categorising into blue, brown or green etc.
Broadly speaking, when you measure something and give it a number value, you create quantitative data. When you classify something, you create categorical data and when you judge something you create qualitative data. So far, so good. But this is just the highest level of data, there are also different types of quantitative data which you must understand to be able to use the tree.
Describe the differenc between nominal and ordinal catergorical data
Categorical data can also be subdivided into two types: nominal and ordinal. As a general rule nominal data are unordered and ordinal data are (yes you guessed it) ordered.
Nominal data are items that are assigned individual named categories that do not have an implicit or natural value or rank. For example, gender (male or female) or fracture incidence (yes or no).
Ordinal data are items which are assigned to categories that do have some kind of implicit or natural order, such as ‘small, medium, or large’. Ordinal variables are often used to describe a patient’s characteristics e.g. stage of hypertension, pain level, and satisfaction.
Why is understanding the distribution of data so important?
Normality is most fundamental assumption to test when choosing a statistical test. Simply put, the mathematics underpinning most statistical tests rely on the data having a normal distribution (i.e. two-thirds of data is within one standard deviation of the mean), and that the distribution is symmetrical (i.e. 50% of data is above the mean average and 50% is below).
What does normality measure
Normality measures the central tendency and dispersion of data and is used to decide how to describe the properties of large data-sets i.e. the descriptive statistics which are presented instead of the raw data.
Why is it important to be able to identify whether data is normally or not normally distributed
it is important to know the distribution when choosing statistical test..
Graph used to determine whether data is normal
histogram
Describe the three basic destributions of data
The red histogram is a sample from a normal distribution that has a symmetric distribution with well-behaved tails i.e., many data points at the central region of the range and a symmetrical disruption either side. You will sometimes hear a normal distribution described as ‘bell curved’ or ‘Gaussian’. The green and blue histograms are from samples that are not normally distributed.
Skewness
Skewed data (blue curve) which is a-symmetric with many data points in the high or low end of the range and an uneven tail (long on one side and short on the other). A left-skewed distribution has a long left tail. Left-skewed distributions are also called negatively-skewed distributions. That’s because there is a long tail in the negative direction on the number line. The mean and median are also to the left of the peak (see figure 4). A right-skewed distribution has a long right tail. Right-skewed distributions are also called positive-skew distributions. That’s because there is a long tail in the positive direction on the number line. The mean and median are also to the right of the peak. Kurtosis describes data that are heavy-tailed or light-tailed relative to a normal distribution. Data sets with high kurtosis tend to have heavy tails, or outliers that create a very wide distribution. Data sets with low kurtosis tend to have light tails, or lack of outliers that create a very narrow distribution.
How do one assess the distribution of their data?
Normality can be visually assessed by evaluating a frequency bar-chart or histogram. Both graphs will quickly reveal – by eye – whether the data is normal or not normal. There are more accurate ways if you are so inclined. There are lots of mathematical ways to determine whether your data is normally distributed. The statistical tests of normality are:
Shapiro-Wilks test: used to test for normality with small sample sizes (n<50)
Kolmogorov-Smirnov: used to test for normality with large sample sizes (n>50)
Where n is the number of samples in the data set (i.e. sample size). These tests can be undertaken using statistical software (SPSS, Stata, R) and a p-value <0.05 is considered to indicate a violation of normality (i.e. the data are NOT normally distributed). There are also websites that will allow you to do this freely e.g. Shapiro Wilk or Kolmogorov-Smirnov
Summarise the need for descriptive statistics
Once the data has been collected it needs to be organised into a format that will make it easier to understand. Descriptive statistics are used to categorise large data-sets into a tangible format. The most basic type of descriptive statistic is a measure of central tendency, which can be either the mean, mode or the median. These measures of central tendency and dispersion are commonly used to describe the properties of large data-sets i.e. the descriptive statistics are presented instead of the raw data. male systolic blood pressure sBP) could be presented as 117.6 ± 12.2 mmHg (mean ± SD) and the female sBP could be presented 106.2 ± 2.4 mmHg (mean ± SEM).
Outline the different measures of dispersion
light work
DIfferentiate between dependent and independent data and exemplfi what longitudinal and cross sectional studies are
The last question in the statistical decision tree asks whether the groups that are being compared are dependent or independent. This question is asking whether each group is composed of the same subjects of interest, or if they are different. Consider the example below. On the left-hand side we can see paired data, and example of dependent groups. The fourth observation in each group is from Sally, and it is important for the test to take this into account, even though we’re testing between the average measurements. It helps to take into account other factors that can be controlled for that may have affected measurements (e.g. genetics, height, perseverance). The right-hand examples show unpaired data, with two different groups with different individuals in. An example of independent groups.
Typically, paired observations arise from measuring the same variable in the same subject at different time-points (this could be referred to as a longitudinal experiment). Unpaired, or independent, observations are seen when comparing two groups with no common factors (this could be referred to as a cross-sectional study).
Explain which type of tests are better when testing for differences
Wherever possible it is better to use parametric rather than non-parametric statistics. Parametric tests are easier to understand, the analyses are more powerful and they are less likely the incorrectly reject or fail to reject a hypothesis.