Outliers Flashcards
An _ is a data point that differs significantly from other values in a dataset.
Outlier
A single data point that significantly deviates from the rest of the dataset.
Point Outlier
This type of outlier is an isolated data point that is far away from the main body of the data.
Global outlier
It is often easy to identify and remove.
This type of outlier is a data point that is unusual in a specific context but may not be outlier in a different context.
Contextual outlier
It is often more difficult to identify and may require additional information or domain knowledge to determine its significance.
What causes outliers?
- variability in data
- measurement errors
- novel phenomena
Why is it important to identify and handle outliers?
Outliers can skew results and affect the performance of machine learning models. By identifying and removing or handling outliers effectively, we can prevent them from biasing the model, reducing its performance, and hindering its interpretability.
The analysis of outlier data is referred to as _.
Outlier Analysis or Outlier Mining
_ plays a crucial role in ensuring the quality and accuracy of machine learning models.
Outlier detection
What are some techniques to detect outliers?
- statistical tests
- Z-Score
- Interquartile Range (IQR)
- visualization techniques
1. box plots
2. histograms - machine learning algorithms
What are some ways to handle outliers?
- removing outliers
- transforming data
- using models that are robust to outliers
A _ is relatively unaffected by extreme values, such as the median.
resistant statistic
A statistic is resistant if it is relatively unaffected by extreme values.
Which statistic is resistant, the mean or the median?
The median (middle value) is resistant while the mean (average) is not.
df.cgpa.mean()
Example: World Gross (in millions)
With Harry Potter
Mean = $150,742,300
Median = $76,658,500
Without Harry Potter
Mean = $141,889,900
Median = $75,009,000
What should you do if an outlier is not a mistake?
Run the analysis twice: once with the outlier and once without, to assess its impact.
This measures the spread of the data from the mean.
Standard deviation
df.cgpa.std()
Sample standard deviation: s
Population standard deviation: (“sigma”)
What does a larger standard deviation indicate?
A larger standard deviation indicates more variability and that the data are more spread out.