Chapter 3 Flashcards
Definition of an Outlier
An outlier is a value that lies significantly outside the pattern of the data.
Standard formula to detect outliers:
Outlier < Q₁ - 1.5(Q₃ - Q₁)
or
Outlier > Q₃ + 1.5(Q₃ - Q₁)
Alternative method:
Outlier < 𝑥̄ - 2σ
or
Outlier > 𝑥̄ + 2σ
Removing Anomalies (Cleaning Data)
The process of removing errors or irrelevant data is called cleaning the data.
Steps in cleaning data:
Identify outliers using IQR or standard deviation.
Decide whether to keep or remove:
Keep if valid.
Remove if an error.
Justify your decision.
Box Plots
A box plot represents:
Minimum (non-outlier)
Lower Quartile (Q₁)
Median (Q₂)
Upper Quartile (Q₃)
Maximum (non-outlier)
Outliers plotted separately
Comparing Box Plots
Compare the medians (center of data).
Compare the IQRs (spread of middle 50%).
Check for skewness.
Look for outliers.
Cumulative Frequency & Quartiles
A cumulative frequency graph helps estimate:
Median at n/2
Lower Quartile (Q₁) at n/4
Upper Quartile (Q₃) at 3n/4
Interquartile Range (IQR)
IQR = Q₃ - Q₁
Measures spread of middle 50% of data.
Not affected by outliers.
Percentile Range
10th to 90th percentile range:
P₉₀ - P₁₀
Better for comparing spread than just using range.
Histograms
A histogram represents continuous data.
Formula for Frequency Density:
Frequency Density = Frequency ÷ Class Width
Properties of Histograms
No gaps between bars.
Area of bar is proportional to frequency.
Used for continuous data.
Estimating from a Histogram
Find total area under the histogram.
Estimate frequency of a subset using part of an area.
Comparing Histograms
When comparing histograms:
Use frequency density, not just bar height.
Compare spread.
Compare peak values.
Skewness
Symmetric (mean ≈ median ≈ mode)
Positive skew (mean > median > mode)
Negative skew (mode > median > mean)
Formula for Skewness:
Skewness = 3 × (𝑥̄ - Median) ÷ σ
Comparing Distributions
Compare location (median, mean).
Compare spread (IQR, range).
Look for skewness.
Comment on outliers.
Exam Tips for Data Representation
Use correct formulas.
Check for outliers.
Use interpolation for quartiles.
For histograms, use frequency density.
Use box plots for comparisons.
Cumulative frequency graphs estimate medians and quartiles.