22 Oct - Processing II (Luca) Flashcards
What do we define as outliers?
Data points that differ significantly from others
What kind of impact can outliers have in machine learning?
KNN: Overfit partition of space
Linear regression: Very sensitive to outliers,
decreases quality of the fit
Decision trees: more robust to outliers, but can introduce overfit
How can we deal with outliers?
Trimming / truncation:
drop values above/below a certain value or percentile
Winsorizing / winsorization:
setting values above/below a certain value or percentile to the closest value
What is an isolation tree?
Simple yet effective anomaly detection
* Use principle similar to Decision Trees
* Algorithm:
* Pick a random split between min and max
* Count how many points are in each partition
* Points in singleton partitions are marked with the
number of splits done so far
* Repeat the split on each partition
* Stop when all points are in singleton partitions
* A point’s number of splits is its anomaly score
* The score needs normalization (see original paper)
How to do forest isolation on n dimensions?
- N-dimensional data (X1, …, Xn)
- Pick a random dimension Xi
- Pick a threshold t on dimension Xi
- Split the datapoints on t
- Repeat from point 1. recursively on
each partition
Why should we remember to handle outliers carefully?
Outliers can be:
* Noise, we want to remove it
* Originated by processes we are not interested in modeling (twitter bots)
* Originated by processes that differ from the main one we are focusing on,
but are still relevant to the analysis (power users)
How to handle missing data?
- Where is it?
- Everywhere in real-world datasets
- Why missing?
- Value is unknown (e.g., missing measurement)
- Value is zero or does not apply
- What is their representation?
- Empty, None, “N/A”, “Null” …
- Who should deal with them?
- If you are performing the analysis, you should
- If you are preparing the dataset, your users should –
but you should document what missing values mean
WHat is the three types of missing data?
Missing Completely At Random (MCAR)
* Missing At Random (MAR)
* Missing Non At Random (MNAR)
How can we deal with missing data?
Deletion
* Delete records with missing values
* Suitable when
* Data is Missing Completely At Random – else, bias
is introduced
* A small % of records is missing
* There is no reliable way to infer the value (see next)
* Deletion of entire column when
* A large % of records is missing
* Column is not crucial for the analysis
Single imputation – average
* Filling missing values with some global criterion
* Default value
* Fill with average/median/mode
* Suitable when
* Data is Missing Completely At Random
* A small % of records is missing
* Introduces bias in distribution
* Variability is reduced
Single imputation – local average
* Filling missing values using local information
* Applicable to data in which records are not independent,
and for which the assumption “what is close is similar”
holds
* Time series
* Audio
* Spatial data
* Images