Lecture 2 - Descriptive Statistics & Exploratory Data Analysis Flashcards
What is Descriptive Statistics?
Describes data via using numerical and graphical methods to summarize and analyze data in a clear and understandable way
What is Inferential Statistics?
Draws inferences about a larger population from a sample
What is EDA?
An approach to data analysis to gain useful insight into the data
Inspect data without any assumptions
1st step of data analysis
Use of both numerical and graphical methods to describe data distribution
What are the 4 functions/benefits/aims/purposes of EDA?
Provide descriptive statistics by
- Giving an overall view of data
- Help to visualize distributions and relationships
Detect anomalies including outliers
Assess the assumptions for confirmatory analysis (i.e. inferential statistics/tests)
Helps to decide appropriate confirmatory analysis
What are the 2 methods of EDA and compare them in terms of benefits/usefulness/aims
Numerical methods are more precise and objective
Graphical methods are better for identifying patterns in the data
What are the 5 key characteristics of distribution data to explore? Describe them (simply)
Centre: middle of the range of values, where the most frequently occurring data values are often found
Spread: amount of variability in the values away from the centre
Shape: includes number of peaks and symmetry of distribution
Gaps: segments within the range from minimum to maximum data values without data value(s)
Outliers: data values that are very different from all other values (extremes)
What is Numerical/Summary Statistics?
Numerical values that describe
- Central tendency of data
- Dispersion/variability of data
What is Central tendency of data?
Mean: arithmetic average = sum of all values/number of values (observations)
Median: middle observation (50th percentile)
Mode: most frequent observation
Alone not enough to describe data distribution
When to use mean, median and mode?
Mean
- preferred and widely used
- quantitative (interval & ratio) data
- considers both number of values and values themselves
- sensitive to extreme values (outliers)
Median
- considers number of values and rank order of values
- robust to outliers
- ordinal, quantitative data that are highly skewed
Mode
- seldom used
- nominal data
- consider only frequency
- robust to outliers but ignore number and rank order of values and rest of values
2 misuses of descriptive statistics
Inappropriate choice of measure of central tendency
Reporting in absolute values versus percentages and relativity; need to consider the context and other information
What is Measure of variability/dispersion/variability of data? Give some of the respective equations/formulas too
Range: difference between the maximum and minimum values
Interquartile range: difference between the values at the 75th and 25th percentile points
Variance: average of squared deviations from the mean
- [sum (X-mean)^2]/(N - 1) = SS/(N - 1), where SS is the sum of squares, and N - 1 because of the reduced degree of freedom for a sample
- Why sum of squared deviations and not sum of deviations? Ans: Mean is in the middle of data values => half these deviations would be +ve and half would be -ve. Therefore, sum of deviations would always be zero
Standard deviation: Square root of variance or Square root of average of squared deviations from the mean i.e. {[sum (X-mean)^2]/(N - 1)}^0.5 = [SS/(N - 1)]^0.5
Coefficient of variation (CV): measure of relative dispersion
- If 2 or more distributions to be compared are expressed in the same units and have similar means, then their variability can be compared directly by comparing their SD
- But if means are very different or are expressed in different units, then need another method
- e.g. mean of 1 (SD 1) vs mean of 100 (SD 1)
- Normalize SD to mean to compare 2 or more distributions
- CV = SD/mean x 100
- No unit and expressed as a percentage
When to use range, interquartile range, variance and standard deviation?
Range
- Only considers 2 points of data but easy to calculate
- Not often provided but sensitive to extreme outliers, therefore provides an idea to presence of outliers when used with mean and SD
- Max and min can be used instead and preferred
IQR
- Considers only 2 points of data and middle 50% of data
- Used when data are ordinal, or when quantitative data are highly skewed; analogous to median
Variance
- Accurate and detailed estimate of variability
- most commonly used measure of “spread” for statistical calculations but usually not reported as a statistic for variability
Standard Deviation
- Accurate and detailed estimate of variability
- Preferred and commonly reported because same unit as mean unlike that of variance
For quantitative, ordinal and nominal variables, which measures of central tendency and measures of variability should we use?
Quantitative: mean & SD or median (when data is highly skewed) & IQR
Ordinal: median & IQR
Nominal: mode (& NIL)
What is graphical methods?
Help to visualize the 5 key characteristics of data distribution
Include
- histogram
- box plot
- scatter plot
- stem and leaf plot
Very important though often not reported in literature
Not the same as graphs presented in results and discussion in inferential statistics
(What are) histograms?
Commonly used for quantitative and qualitative variables
Constructed from simple or grouped frequency distribution
When to use histograms?
Useful for frequency distribution
- good fro presentation as it is familiar to most people
Part of EDA to determine normality and outliers
- can display all characteristics of data distributions is suitable bin size is selected
- particularly useful when there are more than 100 data values
- not reliable if sample size too small as may show spurious patterns
(may not contain enough data points to accurately show the distribution of data –> histogram may be distorted –> not reliable)
(What are) Stem and leaf plots?
Shows shape and distribution
Shows all the data (no information; allows for inspection of individual values)
When to use Stem and leaf plot?
Alternative to histograms but allows for inspection of individual values
Easy to produce by hand and retains all data so useful for quick check
Less “profesional” looking and display of all digits can be distracting or confusing so less commonly used for presentations
More useful when less than 100 data values
(What is) Box plot?
Median
25% or first (lower) and 75% or third (upper) quartiles
Lower whisker: smallest non-outlier = larger of min or (first quartile - 1.5 x IQR)
Upper whisker: largest non-outlier = smaller of max or (third quartile + 1.5 x IQR)
Outliers (anything beyond whiskers)
When to use box plots?
useful for both large and small data set
excellent tool for conveying the centre, spread, shape of and outliers in data distributions, especially when comparing different groups of data
part of EDA to determine normality and outliers
disadvantage: does not convey gaps in data distribution
Scatter plot (What is and when to use?)
Display relationship/association between 2 continuous variables
Show features of the relationship
- strength
- shape (linear/curve)
- direction
- outliers
3 points to mention when describing shape of distribution
symmetry (compare mean and median) vs skewness/kurtosis
skewness/kurtosis
number of modes
3 types of kurtosis (pointyness/peakedness and tailedness)
leptokurtic: slender; +ve kurtosis; higher peak and heavier tails (many scores at tails)
mesokurtic: bell-shaped
platykurtic: broad; -ve kurtosis; flatter peak, lighter tails