Outlier Identification and Removal Flashcards
WHEN CAN WE USE STD OF SAMPLE AS CUT-OFF FOR IDENTIFYING OUTLIERS? P73
When the distribution is Gaussian or Gaussian-like (68,95,99.7 rule)
WHAT ARE THE CUT-OFF VALUES FOR OUTLIERS IN GAUSSIAN/GAUSSIAN-LIKE DISTRIBUTION? P73
Three standard deviations from the mean is a common cut-off in practice for identifying outliers in a Gaussian or Gaussian-like distribution. For smaller samples of data, perhaps a value of 2 standard deviations (95 percent) can be used, and for larger samples, perhaps a value of 4 standard deviations (99.9 percent) can be used.
HOW CAN WE COMPUTE CUT-OFF FOR OUTLIERS IN GAUSSIAN/GAUSSIAN-LIKE DISTRIBUTION? (code) P74
Cut_off=data_std*3
Lower,upper= data_mean-cut_off, data_mean+cut_off
Values lower than Lower and higher that upper, are outliers.
WHAT IS A GOOD WAY FOR HANDLING OUTLIERS IN NON-GAUSSIAN DISTRIBUTED DATA MANUALLY? AND HOW IS IT CALCULATED?
Interquartile range, (75th percentile -25th percentile), it’s calculated by defining limits on the sample values that are a factor k of the IQR below the 25th percentile or above the 75th percentile. K usually is 1.5
HOW TO FIND 25TH AND 75TH PERCENTILE? AND CALCULATE CUTOFF? CODE P76
Percentile(data,25) or percentile(data,75)
IQR=q75-q25
CUTOFF=IQR*1.5
LOWER,UPPER=Q25-CUTOFF,Q75+CUTOFF
WITH WHAT CLASS IN SKLEARN CAN WE AUTOMATICALLY DETECT OUTLIERS? WHAT IS ITS WEAKNESS? HOW DOES IT WORK? P77 (WORKED EXAMPLE P78 P79)
LocalOutlierFactor.
This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.
Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. It marks each row in the training dataset as normal (1) or an outlier (-1).