Week Three - Data Types/Variables/Descriptive Statistics Flashcards
What are the 4 main data types?
categorical/nominal
ordinal
interval
ratio
Define Categorical/Nominal Variables
Discrete
An arbitrary label (eg., male, non-smoker)
Label can be nominal or numerical
Nominal: vanilla, chocolate, strawberry
Numerical: 1, 18, 7
Only valid mathematical operation is counting
Define key characteristics of an Ordinal Scale
Discrete
Inherent order (ranks)
Some information about quantity
Movement along the scale indicates a change in amount, but doesn’t indicate how much change
Can perform logical operations on this scale
(eg., kinder, primary, high, college, bachelor, masters, phd)
Define key characteristics of an Interval Scale
Interval Scales
Order + equal intervals
Continuous (though measurement may not be)
Mathematical operations (addition, subtraction)
How much more (or less) of something is there?
Does not have true zero
If the scale has zero in it, 0 does not mean absence of the thing.
Eg., Temperature (Celsius)
0 ° vs 5° ; 25° vs 30° : difference is 5° (0 ° C does not mean no heat)
Define the key characteristics of Ratio Scales
Order, equal intervals + a true zero
Physical quantities are ratio scale (mass, length, time, etc.)
0 kg = absence of mass; 0 meters = absence of length
Can calculate ratios of different values
50kg is 2X greater than 25kg
What are the 2 forms of discrete variables?
Categorical and ordinal
What is the Mode?
Most commonly occurring value in a set
Sample can have more than one mode
Bimodal = two modal values
Multimodal > two modal values
What is the Median?
Same number of observations below and above the median (middle number)
What is the Mean?
Value around which scores are distributed (average)
What are the 3 most commonly used measures of spread/dispersion?
Range
IQR
Sample SD
Define the ‘Range’. What happens if a range score is an outlier?
Maximum - Minimum
If min and/or max is an outlier, the range overestimates variability in the data
Range tends to increase as sample size increases
Define ‘quartiles’
Quartiles group the data into four ordered, equal groups
What is the lower quartile?
What is the upper quartile?
25% & 75%
What is the IQR?
What does it measure?
Bigger IQR = ?
The difference between Q3 and Q1
IQR measures how the data is spread out
Bigger IQR = greater dispersion
What is variance? What does it measure?
A measurement of the spread between numbers in a data set.
It measures how far each number in the set is from the mean and therefore from every other number in the set.
Variance is roughly the average of the squared difference to the mean
What descriptive statistics can ordinal data be used with?
Median
Quartiles
IQR
Data values tend to be symmetrically clustered around what DS?
What distribution does this tend to have?
The mean
Normal bell-shaped distribution
Characteristics of Positive Skewness
right-skewed = data had high data values more spread out than low values (the mean is dragged to the right end)
Characteristics of Negative Skewness
left-skewed = data has low values more spread out than high values (the mean is dragged to the left end)
What type of graph can be used to assess skewness?
histograms
For symmetric data, mean and median are usually what?
approx equal
What is Kurtosis?
The shape of the two tails
Long and fat tail means? (kurtosis)
low kurtosis (platykurtic)
Peaked distribution and small tails refers to?
high kurtosis (leptokurtic)
What is EDA? Exploratory Data Analysis
Refers to procedures designed to present data in an informative way, using graphical, pictorial and summary methods
eg graphs and tables
What are the 2 ways to present/summarise data for one categorical variable?
frequency and pie charts
Define frequency
Frequency represents the count of observations in each category
What is relative frequency?
Refers to the proportion of the whole represented by the counts in a category
Describe a pie chart
graphical representations used for a single categorical variable with, typically, few categories
What can be used to summarise data from one or more categorical variables
Bar/column graphs
What 3 things can be used to summarise data for one continuous variable?
Stem and leaf plots
histograms
box plots
Describe Stem-and-leaf plots
Group data into intervals of equal length
Actual values of the variable are retained, possibly in a rounded form
Each observation is represented by its last digit
Describe Histograms
Group the data, usually into equal-sized intervals
The area of each box in a histogram is proportional to the frequencies of the intervals of values
Intervals between the boxes in a histogram are often called ‘bins’
Describe Box Plots
Way of presenting continuous data and giving a picture of how the data are distributed
Focus is on the central 50 per cent of the data (median and IQR)
Whiskers cover the remaining data (length = 1.5*IQR)
What can be used to summarise data from more one continuous and one categorical variable?
Histograms, box plots, bar graphs and other graphs and plots can be used to compare continuous data across different categories (levels) of a categorical variable
Plots must be constructed on the same scale as the continuous variable
What can be used to summarise data from more than one continuous variable?
scatterplots
Describe scatter plots
Used to consider the relationship between two quantitative variables
They are particularly useful to understanding the relationship between a response and an explanatory variable
What are some principles of good graphs?
Images are clear
Lines are smooth and sharp
Font is legible and simple
Units of measurement are provided
Axes are clearly labeled
Examples of bad graphs include?
Graphs containing outright mistakes (such as the percentages in a pie chart not summing to 100)
Three-dimensional graphs in which the third dimension does not represent anything and obscures or distorts the information that should be represented
Bar charts with a scale starting above zero