Lecture 7 - Outlier Detection, Feature Selection, Similar Items, Recommender Systems, Naive Bayes Classifiers, Class Imbalance Flashcards
What is an outlier?
An outlier is a data object that deviates significantly from normal objects as if it were generated by a different mechanism.
What can cause outliers?
- Measurement errors
- Data entry errors
- Containment of data from different sources
- Rare events
True or false: You should always try to remove outliers. It makes the machine learning algorithm better
FALSE: If the number of outliers are small, then it’s generally okay to remove them.
However, if the number is large, you have to think whether the outliers mean something, or whether it is ok to remove them
What are some different methods to detect outliers?
- Model-based
- Graphical Approaches
- Cluster-based
- Distance-based
- Supervised-learning
In model based outlier detection, in broad terms, how do we detect outliers?
- We fit a probabilistic model
- Outliers are cases with low probability
Example:
- Assume data follows normal distribution
- The z-score for 1D data is given by:
What is the difference between a global and a local outlier?
A data point is a global outlier when it is out of the normal data range (i.e., most points are in a data range, and this data point is far out).
But, let’s say we have two clusters of points, and in between these clusters, there is a single standing point. We call this a local outlier because it is in the normal data range (given that to its right and left there are more data points), but cannot necessarily be assigned to any of the clusters
Explain the approach of the Graphical Outlier Detection
We plot the data and look for weird points
We (Human) decides whether a data point is an outlier
Can outliers be represented by groups?
Yes, they can. But remember, if the group has a relatively large number of data points, they might not be outliers, but they might be describing an “unusual” event that is worth to be captured in the data (so, do not remove them)
What are some graphical representations that we can use to detect outliers by eye?
Boxplot - plot one variable at a time (and look at the outliers as single standing points, or even analyze the interquartile range)
Scatterplot - plot two variables at a time (is able to capture more complex patterns)
How does cluster-based outlier detection work?
What are the main algorithms that can help in doing that?
Cluster data and then find points that do not belong to any of the clusters.
- K-means: find points that are far away from any mean (but they have been categorized as a part of the cluster) or find clusters that have a small number of data points
- Density-based clustering: outliers are the points that have not been assigned to any cluster
- Hierarchical clustering: outliers take longer to be assigned to a group
How does distance-based clustering outlier detection work? (KNN outlier detection)
For each data point, compute the average distance to its KNNs.
Sort the set of N average distances.
Choose the biggest values as outliers.
btw: KNN was proved to be the most efficient in detecting global outliers
How does supervised outlier detection work?
(I think) you can get a training dataset that has a column saying whether x is an outlier or not.
And you can use it to detect further outliers.
When you want to find outliers graphically, you can analyze the IQR (interquartile range) in a box plot.
What is special about this IQR when the dataset has outliers?
When a dataset has outliers, the interquartile range is often able to summarize the variability in the data.
IQR = Q3 - Q1
What are the advantages and disadvantages to supervised outlier detection?
Advantage:
- Can find very complicated outlier patterns
Disadvantages:
- Is supervised, i.e. needs column labeled “outlier”
- Can not detect new “types” of outliers
How can we define the process of “Feature Selection”?
Feature selection works by selecting features that are “relevant” for predicting the target variable (so variables that have a strong relationship with the target variable)
Name the different approaches of feature selection?
Association approach Regression weight approach Search and Score methods Forward Selection Backward Selection Recursive Feature Elimination
Explain how the Association approach of feature selection works.
- For each feature compute the correlation between the feature values and the target value y
- Say that the feature is relevant if the correlation is above or below some threshold (0.9 and -0.9, for example)
True or False. The Association approach in feature selection is basically a sequential “hypothesis testing” process of the correlation between the variables.
True