Definitions from scratch Flashcards
Two types of variable
Metric variables
Categorical variables
Categorical variables can be
Nominal: relates to named things i.e. it is NOT numeric
It is categorical because we allocate each bit of data to a category e.g. male or female
Ordinal
Nominal data
= categorical nominal variable
Nominal: relates to named things
Category: each data point is placed in a category
Properties of nominal data:
- They do not have units of measurement
- The ordering of categories is arbitrary i.e. does not matter
Example:
Males: 45
Females: 72
Properties of nominal data
Categorical nominal variable
- They do not have units
- The ordering of categories is arbitrary
Example:
Males 43
Females: 52
OR
Females: 52
Males: 43
Ordinal data
=categorical ordinal data
Ordinal data is categorical but it can be ordered in a meaningful way i.e. smallest to largest
e.g. Glasgow coma scale
If person A has a GCS of 5, and person B has a GCS of 10 we can conclude person A’s consciousness is lower BUT we can conclude by how much i.e. we CANNOT say half as much
The difference between adjacent scores is not constant
The seemingly numeric values are NOT number, but labels
Properties of ordinal data
Categorical ordinal data
- Does not have units
- CAN be ordered in a meaningful way
- Nearly always integers
- Assessed rather than measured
NOTE: they do not have a numeric value, they seemingly have numeric values but these are actually labels i.e. GCS of is saying that they fit into a category called GCS 3
What you shouldn’t do with ordinal data
YOU SHOULD NOT TREAT THEM AS NUMBERS
i.e. for ordinal data you should not add, divide, or average it
Ordinal data = number labels
Metric variables can be
Discrete: values occur in discrete intervals i.e. 1, 2, 3, 4, 5,
- comes from counting i.e. number of operations
- difference between each count is constant (in comparison to ordinal data)
- 4 operations is twice as many as 2 operations
Continuous:
Properties of discrete data
Discrete metric data
- Has units
- Discrete variables can be counted, meaning they are real numbers - produce Integers
Continuous data
Continuous metric data
- Values form a continuum
- Real numbers
- Has units
Frequency table
Used to illustrate descriptive statistics
Frequency distribution
Illustrates the number of events in each category
Relative frequency
= percentages
Contingency table
Cross tabulations
Illustrate association between two variables in a single population
Has two columns for the given variable in the row
Ranking data
Allows assessment of non-parametric data
Order data into size
Starting with larges variable, rank this with value of 1
Next value rank as 2
Equal values are tied with the value of the average some of ranks used in tied series e,g, 7 8 5 5 5 3 1
8: 1
7: 2
5: =4
5: =4
5: =4 (3 , 4 , 5 avergae = 4)
3: 6
1: 7
Ogive
Pronounced ojive
Cumulative frequency curve with continuous metric data
Curved (no step) chart
Measures of shape (skew)
Skewness:
-skewness coefficient defined from -1 to +1
Kurtosis:
Left skew
= negative skew
Lots of large values
Negative –> peak is further away from y-axis
Right skew
=positive skew
Lots of small values
“Right skew, close to you”
Distributions
Symmetric: classic one humped distribution
Bimodal: two peaks
Multimodal: multiple peaks
Kurtosis
Measure of distribution
Distributions with large kurtosis exhibit tail data exceeding the tails of the normal distribution (e.g., five or more standard deviations from the mean).
Skewness differentiates extreme values in one versus the other tail, kurtosis measures extreme values in either tail.
If you hold the area the same, if you increase the kurtosis, the peak would get flatter and broad and hence larger spread
Kurtosis value of normal distribution
=3
(excess kurtosis value = 0 i.e. the excess subtracts 3 form calculation)
(uniform distribution =1)
Mode
Useful in categorical data
Useless in continuous data when no two values likely to be the same
Median
CAN BE USED FOR ORDINAL AND CONTINUOUS DATA
Discards a lot of information
Not as affected by skew vs mean
Not as affect by outliers vs mean
Therefore median is a stable measure
Mean
Uses all the data - each value is included
Therefore subjected to effect from outliers and skew
Cannot be performed on ordinal data
Percentiles
Values that divide a data set into 100 equal-sized group
To find percentile, multiply percentage in decimals by (n+1)
Where n is equal to number of data points
Properties of using range
Lowest to highest value
Not affected by skew
Sensitive to outliers which may misguide range
Interquartile range
Removes 25% from each end
Reduces effect of outliers
Affected by skewed distributions
Limitations of interquartile range
Discards 50% of the data!