VARIABLES AND DISTRIBUTIONS - LEARNING OUTCOMES Flashcards

1
Q

What are the two main types of variables?

A

Categorical Variables and Continuous Variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are categorical variables?

A

For a categorical variable the response can be categorised into a number of distinct groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a binary categorical variable?

A

A binary categorical variable is a categorical variable for which there are only two possible responses.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an ordinal categorical variable?

A

An ordinal categorical variable is a categorical variable for which there are 3 or more categories and there is some logical order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a nominal categorical variable?

A

A nominal categorical variable is a categorical variable for which there are 3 or more categories and there is no logical ordering to those categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are continuous variables?

A

Continuous variables are those for which the responses are numerical an may take any value on a well-defined continuous scale e.g. height, weight etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How might you recode a continuous variable?

A

It is possible to recode continuous data as an ordered categorical variable. For example weight may be recoded as an ordered categorical data as underweight, optimal weight, pre-obese or obese, or to a binary variable as underweight or not. However, this will result in a loss of information and may restrict the statistical tests which can be carried out on the new variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe how we can summarise categorical variables.

A

To summarise categorical data (including binary data), we simply count up the number of observations in each category; these counts are called frequencies. We usually express these as proportions or percentages of the total number of individuals. We can then present these numbers either in table format or graphically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe how we can summarise continuous variables. What summary measures can we compute for continuous variables?

A

For continuous variables, we can summarise the data graphically using a histogram or box-plots.

For continuous variables we can also compute summary measures of data location (mean, median etc) and spread (range, standard deviation etc).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe how you would go about creating a histogram representative of a continuous variable.

A

To produce a histogram, we need to first group the data into ranges, and then count the number of observations in each group. These counts are called frequency distribution. Identifying the lowest and highest values first helps you decide on how the data should be grouped. Remember, too few groups will men detail is lost but too many groups will result in hardly any information in each group. Having formed the frequency distribution we can plot the number in each range to get a histogram. In a histogram the bars touch each other (unlike a bar chart) to indicate that the data is continuous.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the characteristics of a normal distribution as seen on a histogram?

A
  • It is symmetrical and bell shaped
  • The two extremes of the distribution are plus and minus infinity; they get very close to the x-axis but never quite reach it
  • 95% of the data lie within 1.96 standard deviations of the mean
  • When plotted with fraction on the y axis, the area under the curve is equal to 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What would a positively skewed distribution look like?

A

A positively skewed distribution would have a longer tail to the right.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What would a negatively skewed distribution look like?

A

A negatively skewed distribution would have a longer tail on the left.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Other than drawing a histogram how else may we summarise a continuous variable?

A

By computing summary measures, namely a measure of where the data is located, and a measure of the spread of the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 3 summary measures of data location for continuous variables?

A
  1. The mean: the sum of all the values divided by the total number of people
  2. The mode: the most common value
  3. The median: the middle value (50th centile) when the variables are placed in ascending numerical order, i.e. the median = 1/2 (n+1)th value. If there are an even number of readings the mean of the middle two values is taken as the median
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 3 measures of spread or variation that can be calculated for continuous variables?

A
  1. Range - the highest to lowest values
  2. Interquartile range - the 25th to 75th centile. The 25th centile is the value for which 25% of the observations fall when put in order, and the 75th centile is the value for which 75% of observations fall below.
  3. Standard deviation - the measure of spread used in conjunction with the mean. The standard deviation is derived from the difference between each individual reading and the mean of all the readings.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How do you work out the interquartile range for a data set of continuous data?

A
Lower quartile (QL) = 1/4 (n+1)th value
Upper quartile (QU) = 3/4 (n+1)th value

IQR = QU - QL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

How do you work out the standard deviation for a data set of continuous data?

A
  1. Work out the difference between each measurement and the overall mean - these are called the deviations.
  2. Square the deviations. This has the impact of making all the results positive and thereby removes negative values.
  3. Add up all the squared deviations and divide by n-1 where n is the number of measurements. This produces what is known as the variance.
  4. Take the square root of the variance to give the standard deviation.
19
Q

What is the formula for calculating standard deviation?

A

sd = sqrt [ sum of (measurements - mean)^2 ] / [number of measurements - 1 ]

20
Q

What is the formula for calculating variance?

A

sd = [sum of (measurements - mean)^2] / [number of measurements - 1]

21
Q

For a symmetrical distribution what would you expect of the mean and the median values?

A

You would expect the mean and median values to be similar.

22
Q

What do the following values for skewness indicate:

a. 0
b. 1
c. -1

A

a. a value of 0 for skewness indicates a normal distribution.
b. a positive value for skewness indicates a pile up of scores on the left of the distribution with a long right tail - positive skewing.
c. a negative value for skewness indicates a pile up of scores on the right of the distribution with a long left tail - negative skewing.

23
Q

When the data are normally distributed what are the most appropriate measure of location and spread?

A

When the data are normally distributed the mean and standard deviation are the most appropriate measures of location and spread.

The standard deviation tells us about the spread of the data because if the data are normally distributed approximately 70% of the reading fall within 1 SD either side of the mean and approximately 95% of readings lie within 2 SDs of the mean.

If data are skewed, we usually present the median and IQR as they are less influenced by very high and very low values.

24
Q

When data are not normally distributed (skewed) what are the most appropriate measures of location and spread?

A

If data are skewed, we usually present the median and IQR as they are less influenced by very high and very low values.

25
Q

What does a kurtosis value of 0 indicate?

A

A normal distribution.

26
Q

What does a positive kurtosis value indicate?

A

Positive kurtosis scores indicate a pointy distribution with low numbers of observations in the tails of the distribution.

27
Q

What does a negative value of kurtosis indicate?

A

Negative kurtosis scores indicate a flat distribution with many observations in the tails.

28
Q

True or false - the further away the skewness and kurtosis values are from zero, the more likely it is that data are not normally distributed.

A

True.

29
Q

True or false - the further away the skewness and kurtosis values are from zero, the more likely it is that data are normally distributed.

A

False - the further away the skewness and kurtosis values are from zero, the more likely it is that data are not normally distributed.

30
Q

How can we use the values of skewness and kurtosis to assess whether we have a normal distribution?

A

We first need to convert these actual values of skewness and kurtosis into a z-score. This allows you to compare the score to a standard normal distribution with a mean of 0 and a SD of 1. These z-scores can be compared against values you would expect to get by chance alone (i.e. known values for the standard normal distribution).

If your z-score is greater than 1.96 it is significant at p

31
Q

What is a nominal categorical variable?

A

A variable that fits into categories but with no order.

32
Q

What is a discrete numerical variable?

A

The number of something - e.g. number of children

33
Q

What is a continuous numerical variable?

A

A numerical variable that is limited only by the degree to which you can measure it - e.g. height in cm

34
Q

What units to nominal categorical variables have?

A

No units

35
Q

What units do discrete numerical variables have?

A

Counted units

36
Q

What units do continuous numerical variables have?

A

Measured units

37
Q

If a variable has no units and it can’t be put into a meaningful order then what kind of variable is it?

A

Categorical nominal

38
Q

If a variable has no units and can be put into a meaningful order then what kind of variable is it?

A

Categorical ordinal

39
Q

If a variable has units that come from counting then what kind of variable is it?

A

A discrete numerical variable

40
Q

If a variable has units and those units come from measuring then what kind of variable is it?

A

A continuous numerical variable

41
Q

How can we summarise categorical data?

A

Count the number of observations in each group (frequency) - more helpful to express these as percentages/proportion. This can be displayed in a table or graphically.

42
Q

How can we summarise continuous data?

A
  • Summary of measures of location - mean, median, mode
  • Summary measures of spread - standard deviation, interquartile range
  • Graphically - histogram, box plot
43
Q

Why are the median and interquartile range more appropriate measure for describing skewed data than the mean and standard deviation?

A

The median and interquartile range are often described as being resistant to outliers. This property makes them much more appropriate for describing skewed data, as they better reflect where the majority of the observations lie within the distribution. This is not the case for the mean and standard deviation.