Outliers Flashcards
An _ is a data point that differs significantly from other values in a dataset.
Outlier
A single data point that significantly deviates from the rest of the dataset.
Point Outlier
This type of outlier is an isolated data point that is far away from the main body of the data.
Global outlier
It is often easy to identify and remove.
This type of outlier is a data point that is unusual in a specific context but may not be outlier in a different context.
Contextual outlier
It is often more difficult to identify and may require additional information or domain knowledge to determine its significance.
What causes outliers?
- variability in data
- measurement errors
- novel phenomena
Why is it important to identify and handle outliers?
Outliers can skew results and affect the performance of machine learning models. By identifying and removing or handling outliers effectively, we can prevent them from biasing the model, reducing its performance, and hindering its interpretability.
The analysis of outlier data is referred to as _.
Outlier Analysis or Outlier Mining
_ plays a crucial role in ensuring the quality and accuracy of machine learning models.
Outlier detection
What are some techniques to detect outliers?
- statistical tests
- Z-Score
- Interquartile Range (IQR)
- visualization techniques
1. box plots
2. histograms - machine learning algorithms
What are some ways to handle outliers?
- removing outliers
- transforming data
- using models that are robust to outliers
A _ is relatively unaffected by extreme values, such as the median.
resistant statistic
A statistic is resistant if it is relatively unaffected by extreme values.
Which statistic is resistant, the mean or the median?
The median (middle value) is resistant while the mean (average) is not.
df.cgpa.mean()
Example: World Gross (in millions)
With Harry Potter
Mean = $150,742,300
Median = $76,658,500
Without Harry Potter
Mean = $141,889,900
Median = $75,009,000
What should you do if an outlier is not a mistake?
Run the analysis twice: once with the outlier and once without, to assess its impact.
This measures the spread of the data from the mean.
Standard deviation
df.cgpa.std()
Sample standard deviation: s
Population standard deviation: (“sigma”)
What does a larger standard deviation indicate?
A larger standard deviation indicates more variability and that the data are more spread out.
For a bell-shaped distribution, about _ of the data falls within two standard deviations of the mean.
95%
For a population, 95% of the data will be between µ – 2 and µ + 2
= sigma symbol
A _ indicates how many standard deviations a value is from the mean.
Z-score
Z = (X - mean) / Standard Deviation
For a population, !𝑥 is replaced with µ and s is replaced with
= sigma symbol
Remove outliers using z-score
from scipy import stats
df[‘cgpa_zscore’] = stats.zscore(df.cgpa)
df[(df.cgpa-zscore > -3) & (df.cgpa-zscore < 3)]
Advantages of Z-score
- Straightforward
- Easy to use
- Useful for normally distributed data
- Quantifies deviation
Disadvantages of Z-score
- Sensitive to non-uniform or skewed data distributions
- Not effective in datasets with many outliers
- Assumes data follows a normal distribution, making it less reliable for non-normal distributions
What is considered an extreme z-score?
A z-score beyond -2 or 2
The _ divide data into four equal parts.
Quartiles
Q1 is the median of the lower half, Q3 is the median of the upper half.
What is a five-number summary?
minimum (Min) = smallest data value
Q1= median of the values below m
median (m) = middle data value
Q3 = median of the values above m
maximum (Ma) = largest data value
The _ is the value which is greater than P% of the data
Pth percentile
We already used z-scores to determine whether an SAT score of 2100 or
an ACT score of 28 is better
We could also have used percentiles:
ACT score of 28: 91st percentile
SAT score of 2100: 97th percentile
The _ is Q3 - Q1, representing the middle 50% of the data.
Interquartile Range (IQR)
Remove outliers using IQR
q1 = df.placement-exam-marks.quantile(0.25)
q3 = df.placement-exam-marks.quantile(0.75)
iqr = q3 - q1
iqr
upper = q3 + (1.5 * iqr)
lower = q1 - (1.5 * iqr)
df[(df[‘placement-exam-marks’] < upper) & (df[‘placement-exam-marks’] > lower)]
Advantages of IQR
- Robust in non-normal distributions
- Less affected by extreme values or outliers
- Effective for skewed datasets
Disadvantages of IQR
- Will not detect outliers effectively in small datasets
- May miss outliers in the tails if the distribution is highly skewed
Is the IQR resistant to outliers?
Yes, the IQR is resistant to outliers.
How can outliers be identified using the IQR?
Outliers are data points smaller than Q1 − 1.5(IQR) or larger than Q3 +1.5(IQR).
upper = q3 + (1.5 * iqr)
lower = q1 - (1.5 * iqr)
df[(df[‘placement-exam-marks’] < upper) & (df[‘placement-exam-marks’] > lower)]
A _ visualizes the five-number summary and shows outliers as points beyond the whiskers.
Boxplot
sb.boxplot(df.placement-exam-marks)
Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier
What is the relationship between range and outliers?
The range is not resistant to outliers because it depends entirely on extreme values.
Range = Max – Min
IQR = Q3 – Q1
When should you use the mean and standard deviation?
Use the mean and standard deviation when you want to incorporate all data points into the analysis, but be cautious of outliers.
What percentile is a data value at if it is greater than 75% of the data?
The 75th percentile (or Q3)