! S7: Outlier Detection, Feature Selection & Recommender System Flashcards
1
Q
Outlier - Definition
A
- data object that deviates significantly from normal objects as if it were generated by different mechanism
- can be supervised or unsupervised
2
Q
Outlier - issue
A
- hard to define precisely
3
Q
Outlier - causes
A
- Measurement error
- Data entry error
- Contamination of data from different sources
- Rare event
4
Q
Outlier - Application
A
- Data Cleaning
- Security & Fraud detection
- Detecting natural disasters
- Astronomy
- Genetics
5
Q
Local vs Global Outlier
A
- Local = value within the range for entire dataset, but unusually high or low for surrounding points
- Global = very high or a very low value relative to all values in a dataset
- Collective = collection deviate significantly from entire data set, but
individual data points are not global or local outlier - Outlier Group
6
Q
Outlier - Detection Methods
A
- Graphical
- Cluster based
- Model based
- Distance based
- Supervised learning
7
Q
Outlier - Detection Methods - Graphical
A
- Plot data & look for weired points (human decide if value = outlier)
- e.g. Box Plot (1 variable at a time) & Scatterplot (2 variables)
8
Q
Outlier - Detection Methods - Cluster based
A
- Cluster data & find points without cluster
- K-Means: Points away from any mean or Clusters with small number of points
- Density-based: Points not assigned to cluster
- Hierarchical: Points that need longer to join other groups
9
Q
Outlier - Detection Methods - Model-based
A
- Fit probabilistic model -> outliers = examples with low probability (e.g. z-core = nr of sd away from mean)
- Con: mean & variance are sensitive to outliers (-> solution: use quantiles, remove sequentially remove outliers)
10
Q
Outlier - Detection Methods - Distance-based
A
- Measure distance & find distanced as outlier
11
Q
Outlier - Detection Methods - Distance-based - Global Outliers
A
- KNN:
- compute average distance for each point to its KNN
- filter points that are most far from theire KNNs
12
Q
Outlier - Detection Methods - Distance-based - Local Outliers
A
- x = (Avg. distance of point “i” to its KNNs) / (Avg. distance of neigbors of “i” to theire KNNS)
- x > 1 = point more fare away than on average -> outlier
13
Q
Feature Selection - Goal
A
- selecting features that are relevant to predict yi from xi
- tradeoff: dont loose info but increase speed / memory space
14
Q
Feature Selection Approaches
A
- Association
- Regression Weight
- Search & Score Methods
- Forward Selection
14
Q
Feature Selection Approaches - Association
A
- hypothesis testing for each feature to select
- For each feature j: Compute correlation between feature values xj and y
- Select j if correlation > 0.9 or <-0.9