Variables and Distributions Flashcards
What do we call the different aspects of the individual in the study that are measured?
These are called variables.
What are the two main types of variables?
Categorical and continuous variables.
What are categorical variables?
With categorical variables the response can be categorised into a number of distinct groups or categories. For example a variable on smoking status may have the categories ‘current smoker’, ‘ex-smoker’ and ‘never-smoker’.
When there are only two possible responses, eg yes or no, diseased or not diseased, the variable is called binary.
When there are three or more categories and there is some logical ‘order’ to the categories (e.g. Never drinker, occasional drinker, regulate drinker), the variable is called an ordinal categorical variable. If there is no ordering (e.g. Ethnic group, sex), this is called a nominal categorical variable.
What is a binary categorical variable?
A binary categorical variable is a categorical variable for which there are only two possible responses. For example yes or no.
What is a ordinal categorical variable?
An ordinal categorical variable is a categorical variable for which there are three or more categories and there is some logical order to the categories (e.g. Never drinker, occasional drinker, regular drinker).
What is a nominal categorical variable?
A nominal categorical variable is a categorical variable for which there are three or more categories but no ordering (e.g. Ethnic group, sex).
What is a continuous variable?
For continuous variables the responses are numerical and may take any value on a well-defined continuous scale e.g. Height, weight, blood pressure, pack-years of smoking. Note it is possible to recode continuous data into a categorical variable - for example weight may be recorded as an ordered categorical data as underweight, optimal weight, pre-obese or obese, or to a binary value of underweight or not. However, this will result in a loss of information and may restrict the statistical tests which can be carried out on the new variable (so make sure this is beneficial before you do it).
How should one go about summarising categorical variables?
To summarise categorical data (including binary data), we simply count up the number of observations in each category; these counts are called frequencies. We usually express these as proportions or percentages of the total number of individuals.
For example we may study 200 people and see that 100 (0.5 or 50%) are never smokers, 60 (0.3 or 30%) are ex smokers and 40 (0.2 or 20%) are current smokers.
We can then present these numbers either in table format or graphically. However you choose to present the data, remember it should be clear and immediately evident to the reader what they are looking at. Use clear headings for columns and rows in tables, and informative legends for charts or figures. Categorical data are plotted either using a bar chart or a pie chart. A bar chart consists of a bar for each category where the lengths of the bars are proportional to the frequencies. Note the bars do not touch as the data are not continuous but fall into distinct categories. In a pie chart the area of each segment is proportional to the frequency in that category.
How should one go about summarising continuous variables?
For continuous variables, we can summarise the data graphically using a histogram or box-plots, and we can also compute summary measures of data location (mean, median etc) and spread (range, standard deviation etc.)
To produce a histogram, we need to first group the data into ranges, and then count the number of observations in each group. These counts are called the frequency distribution. Identifying the lowest and highest values helps you decide on how the data should be grouped. Too few groups will mean detail is lost but too many groups will result in hardly any observations in each group. Having formed the frequency distribution, we can plot the number in each range to get a histogram. In a histogram the bars touch each other (unlike a bar chart) to indicate that the data are continuous. Note the are in each bar is proportional to the number of people in that range. Do not use 3d graphs as they are mathematically wrong.
A wide range of human characteristics follow a Normal distribution. Characteristics of the Normal distribution are:
- it is symmetrical and bell shaped
- the two extremes of the distribution are plus and minus infinity; they get very close to the x-axis but never quite reach it
- 95% of the data lie within 2 (or more precisely 1.96) standard deviations of the mean
- when plotted with fraction on the Y-axis, the area under the curve is equal to 1
Not all continuous variables follow a Normal distribution. Often data are skewed, either to the right (positively skewed, longer tail on the right), or to the left (negatively skewed, longer tail on the left).
Another way of summarising a continuous variable is to compute summary measures, namely a measure of where the data are located, and a measure of the spread of the data. In other words for our height data, in general how tall were the children and how diverse were the range of measurements.
The three measures of data location are:
- the mean: the sum of all the values divided by the total number of people
- the mode: the most common value
- the median: the Middle value (50th centile) when the variables are placed in ascending numerical order, i.e. The median = 1/2 (n+1)th value. If there are an even number of readings the mean of the middle two values is taken as the median.
Describe how you would go about making a histogram for continuous data.
To produce a histogram, we need to first group the data into ranges, and then count the number of observations in each group. These counts are called the frequency distribution. Identifying the lowest and highest values helps you decide on how the data should be grouped. Too few groups will mean detail is lost but too many groups will result in hardly any observations in each group. Having formed the frequency distribution, we can plot the number in each range to get a histogram. In a histogram the bars touch each other (unlike a bar chart) to indicate that the data are continuous. Note the are in each bar is proportional to the number of people in that range. Do not use 3d graphs as they are mathematically wrong.
What are the characteristics of a normal distribution?
A wide range of human characteristics follow a Normal distribution. Characteristics of the Normal distribution are:
- it is symmetrical and bell shaped
- the two extremes of the distribution are plus and minus infinity; they get very close to the x-axis but never quite reach it
- 95% of the data lie within 2 (or more precisely 1.96) standard deviations of the mean
- when plotted with fraction on the Y-axis, the area under the curve is equal to 1
How would we describe continuous data if the plotted histogram of said data has a longer tail on:
A). The right
B). The left
A). Positively skewed data
B). Negatively skewed data
What are the three measures of data location?
The three measures of data location are:
- the mean: the sum of all the values divided by the total number of people
- the mode: the most common value
- the median: the Middle value (50th centile) when the variables are placed in ascending numerical order, i.e. The median = 1/2 (n+1)th value. If there are an even number of readings the mean of the middle two values is taken as the median.
What can be use to measure the spread or variation observations in continuous data?
The spread or variation in the observations can be measured using the following:
- Range - the highest to lowest values
- Interquartile range - the 25th to 75th centile. The 25th centile is the value for which a quarter or 25% of observations fall below when put in order, and the 75th centile is the value for which three quarters of observations fall below. In the same way as the median is calculated, the quartile can be identified as follows:
Lower quartile (QL) = 1/4(n+1)th value Upper quartile (QU) = 3/4(n+1)th value
Interquartile range = QU - QL
- Standard deviation - the measure of spread used in conjunction with the mean. The standard deviation is derived from the difference between each individual reading and the mean of all the readings.
To work out the standard deviation:
1) . First work out the difference between each measurement and the overall mean - these are called the deviations.
2) . Square the deviations. This has the impact of making all the results positive and thereby removes negative values.
3) . Add up all the squared deviations and divide by n-1 where n is the number of measurements. This produces what is known as the variance.
4) . Take the square root of the variance to give the standard deviation.
When the data are normally distributed the mean and standard deviation are the most appropriate measures of location and spread. The standard deviation tells us about the spread of data because if the data are normally distributed:
1) . Approximately 70% of the readings fall within 1 standard deviation either side of the mean.
2) . Approximately 95% of the readings lie within 2 standard deviations of the mean.
If the data are skewed, we usually present the median and IQR as they are less influenced by very high or very low values.
What are the most appropriate measures of location and spread of continuous data if the data are skewed?
If the data are skewed, we usually present the median and IQR as they are less influenced by very high or very low values.
What are the most appropriate measure of location and spread if the data are normally distributed?
When the data are normally distributed the mean and standard deviation are the most appropriate measures of location and spread. The standard deviation tells us about the spread of data because if the data are normally distributed:
1) . Approximately 70% of the readings fall within 1 standard deviation either side of the mean.
2) . Approximately 95% of the readings lie within 2 standard deviations of the mean.
What is the interquartile range and how is it calculated?
- Interquartile range - the 25th to 75th centile. The 25th centile is the value for which a quarter or 25% of observations fall below when put in order, and the 75th centile is the value for which three quarters of observations fall below. In the same way as the median is calculated, the quartile can be identified as follows:
Lower quartile (QL) = 1/4(n+1)th value Upper quartile (QU) = 3/4(n+1)th value
Interquartile range = QU - QL
What is standard deviation?
Standard deviation - the measure of spread used in conjunction with the mean. The standard deviation is derived from the difference between each individual reading and the mean of all the readings.
How do you calculate the standard deviation?
To work out the standard deviation:
1) . First work out the difference between each measurement and the overall mean - these are called the deviations.
2) . Square the deviations. This has the impact of making all the results positive and thereby removes negative values.
3) . Add up all the squared deviations and divide by n-1 where n is the number of measurements. This produces what is known as the variance.
4) . Take the square root of the variance to give the standard deviation.
What characteristic of the median and interquartile range makes them much more appropriate for describing skewed data than the mean and standard deviation?
The median and interquartile range are often described as being resistant to outliers. This property makes them much more appropriate for describing skewed data, as they better reflect where the majority of the observations lie within the distribution. This is not the case for the mean and standard deviation.
For symmetrical distributions, the ______ and _______ of the data will be the same.
For symmetrical distributions, the mean and median of th data will be the same.
For positively skewed distributions the ______ will be greater than the _______.
For positively skewed data the mean will be greater than the median.
For negatively skewed data the _______ will be greater than the _______.
For negatively skewed data the median will be greater than the mean.
What is one advantage of a stem and leaf plot over a histogram?
On advantage of the stem and leaf plot over the histogram is that the stem and leaf plot displays not only the frequency for each interval, but also displays all of the individual values within that interval.
What is a box and whisker plot especially useful for?
A box and whisker plot does not show a distribution in as much detail as a stem and leaf or histogram plot does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations or outliers in the data set.