22 Oct - Processing II (Luca) Flashcards

Question 1

Q

What do we define as outliers?

Answer

A

Data points that differ significantly from others

Question 2

Q

What kind of impact can outliers have in machine learning?

Answer

A

KNN: Overfit partition of space
Linear regression: Very sensitive to outliers,
decreases quality of the fit
Decision trees: more robust to outliers, but can introduce overfit

Question 3

Q

How can we deal with outliers?

Answer

A

Trimming / truncation:
drop values above/below a certain value or percentile

Winsorizing / winsorization:
setting values above/below a certain value or percentile to the closest value

Question 4

Q

What is an isolation tree?

Answer

A

Simple yet effective anomaly detection
* Use principle similar to Decision Trees
* Algorithm:
* Pick a random split between min and max
* Count how many points are in each partition
* Points in singleton partitions are marked with the
number of splits done so far
* Repeat the split on each partition
* Stop when all points are in singleton partitions
* A point’s number of splits is its anomaly score
* The score needs normalization (see original paper)

Question 5

Q

How to do forest isolation on n dimensions?

Answer

A

N-dimensional data (X1, …, Xn)
Pick a random dimension Xi
Pick a threshold t on dimension Xi
Split the datapoints on t
Repeat from point 1. recursively on
each partition

Question 6

Q

Why should we remember to handle outliers carefully?

Answer

A

Outliers can be:
* Noise, we want to remove it
* Originated by processes we are not interested in modeling (twitter bots)
* Originated by processes that differ from the main one we are focusing on,
but are still relevant to the analysis (power users)

Question 7

Q

How to handle missing data?

Answer

A

Where is it?
Everywhere in real-world datasets
Why missing?
Value is unknown (e.g., missing measurement)
Value is zero or does not apply
What is their representation?
Empty, None, “N/A”, “Null” …
Who should deal with them?
If you are performing the analysis, you should
If you are preparing the dataset, your users should –
but you should document what missing values mean

Question 8

Q

WHat is the three types of missing data?

Answer

A

Missing Completely At Random (MCAR)
* Missing At Random (MAR)
* Missing Non At Random (MNAR)

Question 9

Q

How can we deal with missing data?

Answer

A

Deletion
* Delete records with missing values
* Suitable when
* Data is Missing Completely At Random – else, bias
is introduced
* A small % of records is missing
* There is no reliable way to infer the value (see next)
* Deletion of entire column when
* A large % of records is missing
* Column is not crucial for the analysis

Single imputation – average
* Filling missing values with some global criterion
* Default value
* Fill with average/median/mode
* Suitable when
* Data is Missing Completely At Random
* A small % of records is missing
* Introduces bias in distribution
* Variability is reduced

Single imputation – local average
* Filling missing values using local information
* Applicable to data in which records are not independent,
and for which the assumption “what is close is similar”
holds
* Time series
* Audio
* Spatial data
* Images

Question 10

Q