Lecture 2 - Descriptive Statistics & Exploratory Data Analysis Flashcards
What is Descriptive Statistics?
Describes data via using numerical and graphical methods to summarize and analyze data in a clear and understandable way
What is Inferential Statistics?
Draws inferences about a larger population from a sample
What is EDA?
An approach to data analysis to gain useful insight into the data
Inspect data without any assumptions
1st step of data analysis
Use of both numerical and graphical methods to describe data distribution
What are the 4 functions/benefits/aims/purposes of EDA?
Provide descriptive statistics by
- Giving an overall view of data
- Help to visualize distributions and relationships
Detect anomalies including outliers
Assess the assumptions for confirmatory analysis (i.e. inferential statistics/tests)
Helps to decide appropriate confirmatory analysis
What are the 2 methods of EDA and compare them in terms of benefits/usefulness/aims
Numerical methods are more precise and objective
Graphical methods are better for identifying patterns in the data
What are the 5 key characteristics of distribution data to explore? Describe them (simply)
Centre: middle of the range of values, where the most frequently occurring data values are often found
Spread: amount of variability in the values away from the centre
Shape: includes number of peaks and symmetry of distribution
Gaps: segments within the range from minimum to maximum data values without data value(s)
Outliers: data values that are very different from all other values (extremes)
What is Numerical/Summary Statistics?
Numerical values that describe
- Central tendency of data
- Dispersion/variability of data
What is Central tendency of data?
Mean: arithmetic average = sum of all values/number of values (observations)
Median: middle observation (50th percentile)
Mode: most frequent observation
Alone not enough to describe data distribution
When to use mean, median and mode?
Mean
- preferred and widely used
- quantitative (interval & ratio) data
- considers both number of values and values themselves
- sensitive to extreme values (outliers)
Median
- considers number of values and rank order of values
- robust to outliers
- ordinal, quantitative data that are highly skewed
Mode
- seldom used
- nominal data
- consider only frequency
- robust to outliers but ignore number and rank order of values and rest of values
2 misuses of descriptive statistics
Inappropriate choice of measure of central tendency
Reporting in absolute values versus percentages and relativity; need to consider the context and other information
What is Measure of variability/dispersion/variability of data? Give some of the respective equations/formulas too
Range: difference between the maximum and minimum values
Interquartile range: difference between the values at the 75th and 25th percentile points
Variance: average of squared deviations from the mean
- [sum (X-mean)^2]/(N - 1) = SS/(N - 1), where SS is the sum of squares, and N - 1 because of the reduced degree of freedom for a sample
- Why sum of squared deviations and not sum of deviations? Ans: Mean is in the middle of data values => half these deviations would be +ve and half would be -ve. Therefore, sum of deviations would always be zero
Standard deviation: Square root of variance or Square root of average of squared deviations from the mean i.e. {[sum (X-mean)^2]/(N - 1)}^0.5 = [SS/(N - 1)]^0.5
Coefficient of variation (CV): measure of relative dispersion
- If 2 or more distributions to be compared are expressed in the same units and have similar means, then their variability can be compared directly by comparing their SD
- But if means are very different or are expressed in different units, then need another method
- e.g. mean of 1 (SD 1) vs mean of 100 (SD 1)
- Normalize SD to mean to compare 2 or more distributions
- CV = SD/mean x 100
- No unit and expressed as a percentage
When to use range, interquartile range, variance and standard deviation?
Range
- Only considers 2 points of data but easy to calculate
- Not often provided but sensitive to extreme outliers, therefore provides an idea to presence of outliers when used with mean and SD
- Max and min can be used instead and preferred
IQR
- Considers only 2 points of data and middle 50% of data
- Used when data are ordinal, or when quantitative data are highly skewed; analogous to median
Variance
- Accurate and detailed estimate of variability
- most commonly used measure of “spread” for statistical calculations but usually not reported as a statistic for variability
Standard Deviation
- Accurate and detailed estimate of variability
- Preferred and commonly reported because same unit as mean unlike that of variance
For quantitative, ordinal and nominal variables, which measures of central tendency and measures of variability should we use?
Quantitative: mean & SD or median (when data is highly skewed) & IQR
Ordinal: median & IQR
Nominal: mode (& NIL)
What is graphical methods?
Help to visualize the 5 key characteristics of data distribution
Include
- histogram
- box plot
- scatter plot
- stem and leaf plot
Very important though often not reported in literature
Not the same as graphs presented in results and discussion in inferential statistics
(What are) histograms?
Commonly used for quantitative and qualitative variables
Constructed from simple or grouped frequency distribution