VARIABLES AND DISTRIBUTIONS - LEARNING OUTCOMES Flashcards
What are the two main types of variables?
Categorical Variables and Continuous Variables.
What are categorical variables?
For a categorical variable the response can be categorised into a number of distinct groups.
What is a binary categorical variable?
A binary categorical variable is a categorical variable for which there are only two possible responses.
What is an ordinal categorical variable?
An ordinal categorical variable is a categorical variable for which there are 3 or more categories and there is some logical order.
What is a nominal categorical variable?
A nominal categorical variable is a categorical variable for which there are 3 or more categories and there is no logical ordering to those categories.
What are continuous variables?
Continuous variables are those for which the responses are numerical an may take any value on a well-defined continuous scale e.g. height, weight etc.
How might you recode a continuous variable?
It is possible to recode continuous data as an ordered categorical variable. For example weight may be recoded as an ordered categorical data as underweight, optimal weight, pre-obese or obese, or to a binary variable as underweight or not. However, this will result in a loss of information and may restrict the statistical tests which can be carried out on the new variable.
Describe how we can summarise categorical variables.
To summarise categorical data (including binary data), we simply count up the number of observations in each category; these counts are called frequencies. We usually express these as proportions or percentages of the total number of individuals. We can then present these numbers either in table format or graphically.
Describe how we can summarise continuous variables. What summary measures can we compute for continuous variables?
For continuous variables, we can summarise the data graphically using a histogram or box-plots.
For continuous variables we can also compute summary measures of data location (mean, median etc) and spread (range, standard deviation etc).
Describe how you would go about creating a histogram representative of a continuous variable.
To produce a histogram, we need to first group the data into ranges, and then count the number of observations in each group. These counts are called frequency distribution. Identifying the lowest and highest values first helps you decide on how the data should be grouped. Remember, too few groups will men detail is lost but too many groups will result in hardly any information in each group. Having formed the frequency distribution we can plot the number in each range to get a histogram. In a histogram the bars touch each other (unlike a bar chart) to indicate that the data is continuous.
What are the characteristics of a normal distribution as seen on a histogram?
- It is symmetrical and bell shaped
- The two extremes of the distribution are plus and minus infinity; they get very close to the x-axis but never quite reach it
- 95% of the data lie within 1.96 standard deviations of the mean
- When plotted with fraction on the y axis, the area under the curve is equal to 1
What would a positively skewed distribution look like?
A positively skewed distribution would have a longer tail to the right.
What would a negatively skewed distribution look like?
A negatively skewed distribution would have a longer tail on the left.
Other than drawing a histogram how else may we summarise a continuous variable?
By computing summary measures, namely a measure of where the data is located, and a measure of the spread of the data.
What are the 3 summary measures of data location for continuous variables?
- The mean: the sum of all the values divided by the total number of people
- The mode: the most common value
- The median: the middle value (50th centile) when the variables are placed in ascending numerical order, i.e. the median = 1/2 (n+1)th value. If there are an even number of readings the mean of the middle two values is taken as the median
What are the 3 measures of spread or variation that can be calculated for continuous variables?
- Range - the highest to lowest values
- Interquartile range - the 25th to 75th centile. The 25th centile is the value for which 25% of the observations fall when put in order, and the 75th centile is the value for which 75% of observations fall below.
- Standard deviation - the measure of spread used in conjunction with the mean. The standard deviation is derived from the difference between each individual reading and the mean of all the readings.
How do you work out the interquartile range for a data set of continuous data?
Lower quartile (QL) = 1/4 (n+1)th value Upper quartile (QU) = 3/4 (n+1)th value
IQR = QU - QL