outliers and text mining Flashcards
what are ouliers
they consist of the object that deviates from the normal
what are the types of outliers
Global outlier when a point significantly deviate from the rest of the data (very high or low speed )
Contextual outliers deviate based on a selected context ( ex. Is 30o C an outlier ? Is 40km/h on the highway an outlier ? )
collective outlier subset of objects that collectively deviate from a specific from data
what are the challenges that might be faced in regards to ouliers detection
modelling outliers and objects properly
application-specific outliers detection
handling noise in outliers
understandability of outliers
explain briefly the distance-based outlier detection
for a given r the distrance threshold and pi the fraction threshold we define whether specific point is an outlier
explain briefly the distance-based outlier detection
for a given r the distance threshold and pi the fraction threshold we define whether a specific point is an outlier, hence we can determine whether an object o is an outlier by checking the distance between o and its k-nearest neighbours.
The downside of this algorithm is that its complexity is N^2
explain briefly a grid-based outlier
the idea is that each cell is a hypercube with a diagonal length of r/2,
then we have two pruning rules :
for level 1 pruning all the cells marked as 1 are definitely neighbours
for level 2 pruning: for the cells that exceed 1 a+b1+b2 < pi*n + 1 then they are outliers
explain density based outlier detection
comparing outliers compared to their local neighbourhood, the idea is that the density around an outlier object is significantly different around its neighbors , for this we rely on the relative density of an object against its neighbors as the indicator of the degree object
explain clustering-based outlier
An object is an outlier if
• (1) it does not belong to any cluster,
• (2) there is a large distance between the object and its closest cluster ,
• or (3) it belongs to a small or sparse cluster
what are the strength and weaknesses of clustering-based methods
strengt, it may detect outliers without requiring any labeled data , work well for many types of data while cons is the effectiveness depends highly on the chosen algorithm
explain classification based clustering
Idea: Train a classification model that can
distinguish “normal” data from outliers
• Requires many abnormal samples.
• Abnormal might not well cluster.