Basic Data Preparation Flashcards
What is an outlier?
a data point that’s very different from the rest
point outliers
values far from the rest of the data
contextual outlier
value isn’t far from the rest overall,but is far from points nearby in time
collective outlier
something is missing in a range of points, but can’t tell exactly where
box and whisker plot
visual to find outliers in one dimension
another approach to outlier detection
fit exponential smoothing model
-points with very large error might be an outlier
What causes outliers?
-bad data: sensor failure, contaminated experiment, wrong data input
-or it could be real data
- need to investigate
Dealing with outliers
bad data - omit data, imputation (add in a better value)
real data - outliers are expected in large data sets
ex: normally distributed 4% of data outside 2 standard deviations
-with 1m data points >2000 would be outside 3 sd
Dangers of dealing with outliers
-removing real data outliers can be too optimistic
-ex:
time to transport perishable medicine from us to africa
-outlying data points- weather events or political issues… should these be included in your model?
alternative way to deal with outliers
logistic regression model - estimate probability of outliers to happen under different conditions
second model - estimate length of delivery under normal conditions
-use data without outliers
draw box and whisker plot and name the different parts of it