22 Oct - Processing II (Luca) Flashcards

1
Q

What do we define as outliers?

A

Data points that differ significantly from others

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What kind of impact can outliers have in machine learning?

A

KNN: Overfit partition of space
Linear regression: Very sensitive to outliers,
decreases quality of the fit
Decision trees: more robust to outliers, but can introduce overfit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How can we deal with outliers?

A

Trimming / truncation:
drop values above/below a certain value or percentile

Winsorizing / winsorization:
setting values above/below a certain value or percentile to the closest value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an isolation tree?

A

Simple yet effective anomaly detection
* Use principle similar to Decision Trees
* Algorithm:
* Pick a random split between min and max
* Count how many points are in each partition
* Points in singleton partitions are marked with the
number of splits done so far
* Repeat the split on each partition
* Stop when all points are in singleton partitions
* A point’s number of splits is its anomaly score
* The score needs normalization (see original paper)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How to do forest isolation on n dimensions?

A
  1. N-dimensional data (X1, …, Xn)
  2. Pick a random dimension Xi
  3. Pick a threshold t on dimension Xi
  4. Split the datapoints on t
  5. Repeat from point 1. recursively on
    each partition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Why should we remember to handle outliers carefully?

A

Outliers can be:
* Noise, we want to remove it
* Originated by processes we are not interested in modeling (twitter bots)
* Originated by processes that differ from the main one we are focusing on,
but are still relevant to the analysis (power users)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How to handle missing data?

A
  • Where is it?
  • Everywhere in real-world datasets
  • Why missing?
  • Value is unknown (e.g., missing measurement)
  • Value is zero or does not apply
  • What is their representation?
  • Empty, None, “N/A”, “Null” …
  • Who should deal with them?
  • If you are performing the analysis, you should
  • If you are preparing the dataset, your users should –
    but you should document what missing values mean
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

WHat is the three types of missing data?

A

Missing Completely At Random (MCAR)
* Missing At Random (MAR)
* Missing Non At Random (MNAR)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can we deal with missing data?

A

Deletion
* Delete records with missing values
* Suitable when
* Data is Missing Completely At Random – else, bias
is introduced
* A small % of records is missing
* There is no reliable way to infer the value (see next)
* Deletion of entire column when
* A large % of records is missing
* Column is not crucial for the analysis

Single imputation – average
* Filling missing values with some global criterion
* Default value
* Fill with average/median/mode
* Suitable when
* Data is Missing Completely At Random
* A small % of records is missing
* Introduces bias in distribution
* Variability is reduced

Single imputation – local average
* Filling missing values using local information
* Applicable to data in which records are not independent,
and for which the assumption “what is close is similar”
holds
* Time series
* Audio
* Spatial data
* Images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly