Variables & Distribution Flashcards

Question 1

Q

What is the defining feature of categorical data?

What are the three sub-types of categorical data? Give examples

Answer

A

Binary = only two possible responses (e.g. Yes or no, disease or no disease etc)
Nominal = three or more with no logical order (e.g. Blood groups)
Ordinal = three or more with some logical order present (e.g. Tumour stage that increases in severity from 1-4)

Defining feature of categorical data = no units

Question 2

Q

What are the two types of continuous data? Give examples

Answer

A

Discrete (counted units) e.g. No of children, no of times something happens etc
Numerical (measured units) e.g. Height, weight

Question 3

Q

How do we summarise categorical data?

Answer

A

Count up the no. of observations (=frequency)

- Express these as proportions/percentages of the total no. of individuals

Question 4

Q

What are the two ways in which we can display categorical data?

Answer

A

Table format

2. Graphically (e.g. Bar chart, pie chart)

Question 5

Q

What are the differences between a bar chart and a pie chart?

Answer

A

A bar chart consists of a bar for each category with length of bars proportional to the frequencies. Bars do not touch as data are not continuous but fall into distinct categories

In a pie chart the area of each segment is proportional to the frequency in that category (e.g. If 50% are smokers then the angle would be 360/2 = 180)

Question 6

Q

What are the three ways in which we can summarise continuous data?

Answer

A

Summary measures of location (mean/median/mode)
Summary measures of spread (SD, IQR)
Graphically (histogram/boxplot)

Question 7

Q

Describe the production of a histogram

Answer

A

group the data into ranges and then count the number of observations in each group (=frequency distribution)
plot the number in each range to form a histogram

The bars will touch each other to indicate data is continuous. The area in each bar is proportional to the no. of ppl in that range

Question 8

Q

A histogram can demonstrate normal (Gaussian) distribution. Describe the characteristics of this

Answer

A

continuous variables
cluster around a central value
Symmetrical and bell-shaped
95% of the data lie within 1.96 SDs of the mean
when plotted with fraction on the Y-axis, AUC = 1

Question 9

Q

If data is not normally distributed, what is it?

Answer

A

Skewed distribution

positive skew = longer tail on the right
negative skew = longer tail on the left

Question 10

Q

What are the two main elements used to describe data?

Answer

A

Location: where on average do the data lie?

2. Spread: how much variation is there in the data?

Question 11

Q

When should you use mean and when should you use median?

Answer

A

If data is symmetrical (normally distributed), then the mean is fine.

If data is skewed, median should be used as this is more resistant to outliers

Question 12

Q

How can skewed data affect the mean and median?

Answer

A

If data is symmetrical, mean = median

If data is left (negative) skew, meanmedian

Question 13

Q

What are the three main measures of spread with regards to descriptive stats?

Answer

A

SD: a measure of how far each observation deviates from the mean
IQR: quartiles separate the data into 4 equally sized groups. The IQR indicates where the middle 50% of the data lie (25th-75th centile)
Range: the highest to lowest values

Question 14

Q

How do you calculate standard deviation?

Answer

A

Work out difference between each measurement and overall mean (=the deviations)
Square the deviations (removes any negatives)
Add up all squared deviations and divide by n-1 (n=no. of measurements). This produces the variance
Take the square root of the variance to obtain the SD

Question 15

Q

How do you calculate IQR?

Answer

A

lower quartile (QL) = 0.25(n+1)th value
upper quartile (QU) = 0.75(n+1)th value

IQR = QU - QL

Question 16

Q

The SD tells us about the spread of the data because if the data are normally distributed………?

Answer

A

approx 70% of the readings lie within 1SD of the mean

- approx 95% of the readings lie within 2SD of the mean

Question 17

Q

What are the five ways in which data normality can be assessed?

Answer

A

Plot a histogram: is it symmetrical around a central cluster?
Produce a boxplot: is there symmetry around the median with whiskers of equal length?
Are the mean and median similar? Expected for symmetrical distribution
What is the value for skewness? Should be zero for normal distribution with positive values indicating long right tail and negative values indicating long left tail
What is the value for kurtosis? Should be zero for normal distribution. Positive scores indicate low number of observations in the tails (pointy distribution), negative scores indicate many observations in the tails (flat distribution)

Question 18

Q

What are the three main measures of location with regards to descriptive statistics?

Answer

A

Mean: sum of all values divided by the no. of observations
Median: central value when all observations are ordered (=50th centile, 0.5(n+1))
Mode: most commonly occurring value in the dataset