Variables & Distribution Flashcards

1
Q

What is the defining feature of categorical data?

What are the three sub-types of categorical data? Give examples

A
  • Binary = only two possible responses (e.g. Yes or no, disease or no disease etc)
  • Nominal = three or more with no logical order (e.g. Blood groups)
  • Ordinal = three or more with some logical order present (e.g. Tumour stage that increases in severity from 1-4)

Defining feature of categorical data = no units

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two types of continuous data? Give examples

A
  • Discrete (counted units) e.g. No of children, no of times something happens etc
  • Numerical (measured units) e.g. Height, weight
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do we summarise categorical data?

A
  • Count up the no. of observations (=frequency)

- Express these as proportions/percentages of the total no. of individuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two ways in which we can display categorical data?

A
  1. Table format

2. Graphically (e.g. Bar chart, pie chart)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the differences between a bar chart and a pie chart?

A

A bar chart consists of a bar for each category with length of bars proportional to the frequencies. Bars do not touch as data are not continuous but fall into distinct categories

In a pie chart the area of each segment is proportional to the frequency in that category (e.g. If 50% are smokers then the angle would be 360/2 = 180)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three ways in which we can summarise continuous data?

A
  1. Summary measures of location (mean/median/mode)
  2. Summary measures of spread (SD, IQR)
  3. Graphically (histogram/boxplot)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe the production of a histogram

A
  • group the data into ranges and then count the number of observations in each group (=frequency distribution)
  • plot the number in each range to form a histogram

The bars will touch each other to indicate data is continuous. The area in each bar is proportional to the no. of ppl in that range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

A histogram can demonstrate normal (Gaussian) distribution. Describe the characteristics of this

A
  • continuous variables
  • cluster around a central value
  • Symmetrical and bell-shaped
  • 95% of the data lie within 1.96 SDs of the mean
  • when plotted with fraction on the Y-axis, AUC = 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

If data is not normally distributed, what is it?

A

Skewed distribution

  • positive skew = longer tail on the right
  • negative skew = longer tail on the left
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two main elements used to describe data?

A
  1. Location: where on average do the data lie?

2. Spread: how much variation is there in the data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

When should you use mean and when should you use median?

A

If data is symmetrical (normally distributed), then the mean is fine.

If data is skewed, median should be used as this is more resistant to outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How can skewed data affect the mean and median?

A

If data is symmetrical, mean = median

If data is left (negative) skew, meanmedian

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are the three main measures of spread with regards to descriptive stats?

A
  1. SD: a measure of how far each observation deviates from the mean
  2. IQR: quartiles separate the data into 4 equally sized groups. The IQR indicates where the middle 50% of the data lie (25th-75th centile)
  3. Range: the highest to lowest values
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you calculate standard deviation?

A
  1. Work out difference between each measurement and overall mean (=the deviations)
  2. Square the deviations (removes any negatives)
  3. Add up all squared deviations and divide by n-1 (n=no. of measurements). This produces the variance
  4. Take the square root of the variance to obtain the SD
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do you calculate IQR?

A
  • lower quartile (QL) = 0.25(n+1)th value
  • upper quartile (QU) = 0.75(n+1)th value

IQR = QU - QL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

The SD tells us about the spread of the data because if the data are normally distributed………?

A
  • approx 70% of the readings lie within 1SD of the mean

- approx 95% of the readings lie within 2SD of the mean

17
Q

What are the five ways in which data normality can be assessed?

A
  1. Plot a histogram: is it symmetrical around a central cluster?
  2. Produce a boxplot: is there symmetry around the median with whiskers of equal length?
  3. Are the mean and median similar? Expected for symmetrical distribution
  4. What is the value for skewness? Should be zero for normal distribution with positive values indicating long right tail and negative values indicating long left tail
  5. What is the value for kurtosis? Should be zero for normal distribution. Positive scores indicate low number of observations in the tails (pointy distribution), negative scores indicate many observations in the tails (flat distribution)
18
Q

What are the three main measures of location with regards to descriptive statistics?

A
  1. Mean: sum of all values divided by the no. of observations
  2. Median: central value when all observations are ordered (=50th centile, 0.5(n+1))
  3. Mode: most commonly occurring value in the dataset