Variables & Distribution Flashcards
What is the defining feature of categorical data?
What are the three sub-types of categorical data? Give examples
- Binary = only two possible responses (e.g. Yes or no, disease or no disease etc)
- Nominal = three or more with no logical order (e.g. Blood groups)
- Ordinal = three or more with some logical order present (e.g. Tumour stage that increases in severity from 1-4)
Defining feature of categorical data = no units
What are the two types of continuous data? Give examples
- Discrete (counted units) e.g. No of children, no of times something happens etc
- Numerical (measured units) e.g. Height, weight
How do we summarise categorical data?
- Count up the no. of observations (=frequency)
- Express these as proportions/percentages of the total no. of individuals
What are the two ways in which we can display categorical data?
- Table format
2. Graphically (e.g. Bar chart, pie chart)
What are the differences between a bar chart and a pie chart?
A bar chart consists of a bar for each category with length of bars proportional to the frequencies. Bars do not touch as data are not continuous but fall into distinct categories
In a pie chart the area of each segment is proportional to the frequency in that category (e.g. If 50% are smokers then the angle would be 360/2 = 180)
What are the three ways in which we can summarise continuous data?
- Summary measures of location (mean/median/mode)
- Summary measures of spread (SD, IQR)
- Graphically (histogram/boxplot)
Describe the production of a histogram
- group the data into ranges and then count the number of observations in each group (=frequency distribution)
- plot the number in each range to form a histogram
The bars will touch each other to indicate data is continuous. The area in each bar is proportional to the no. of ppl in that range
A histogram can demonstrate normal (Gaussian) distribution. Describe the characteristics of this
- continuous variables
- cluster around a central value
- Symmetrical and bell-shaped
- 95% of the data lie within 1.96 SDs of the mean
- when plotted with fraction on the Y-axis, AUC = 1
If data is not normally distributed, what is it?
Skewed distribution
- positive skew = longer tail on the right
- negative skew = longer tail on the left
What are the two main elements used to describe data?
- Location: where on average do the data lie?
2. Spread: how much variation is there in the data?
When should you use mean and when should you use median?
If data is symmetrical (normally distributed), then the mean is fine.
If data is skewed, median should be used as this is more resistant to outliers
How can skewed data affect the mean and median?
If data is symmetrical, mean = median
If data is left (negative) skew, meanmedian
What are the three main measures of spread with regards to descriptive stats?
- SD: a measure of how far each observation deviates from the mean
- IQR: quartiles separate the data into 4 equally sized groups. The IQR indicates where the middle 50% of the data lie (25th-75th centile)
- Range: the highest to lowest values
How do you calculate standard deviation?
- Work out difference between each measurement and overall mean (=the deviations)
- Square the deviations (removes any negatives)
- Add up all squared deviations and divide by n-1 (n=no. of measurements). This produces the variance
- Take the square root of the variance to obtain the SD
How do you calculate IQR?
- lower quartile (QL) = 0.25(n+1)th value
- upper quartile (QU) = 0.75(n+1)th value
IQR = QU - QL