- data object that deviates significantly from normal objects as if it were generated by different mechanism - can be supervised or unsupervised

- Measurement error - Data entry error - Contamination of data from different sources - Rare event

! S7: Outlier Detection, Feature Selection & Recommender System Flashcards by Linda Caro

Outlier - Definition

data object that deviates significantly from normal objects as if it were generated by different mechanism
can be supervised or unsupervised

How well did you know this?

Not at all

Perfectly

Outlier - issue

hard to define precisely

How well did you know this?

Not at all

Perfectly

Outlier - causes

Measurement error
Data entry error
Contamination of data from different sources
Rare event

How well did you know this?

Not at all

Perfectly

Outlier - Application

Data Cleaning
Security & Fraud detection
Detecting natural disasters
Astronomy
Genetics

How well did you know this?

Not at all

Perfectly

Local vs Global Outlier

Local = value within the range for entire dataset, but unusually high or low for surrounding points
Global = very high or a very low value relative to all values in a dataset
Collective = collection deviate significantly from entire data set, but
individual data points are not global or local outlier
Outlier Group

How well did you know this?

Not at all

Perfectly

Outlier - Detection Methods

Graphical
Cluster based
Model based
Distance based
Supervised learning

How well did you know this?

Not at all

Perfectly

Outlier - Detection Methods - Graphical

Plot data & look for weired points (human decide if value = outlier)
e.g. Box Plot (1 variable at a time) & Scatterplot (2 variables)

How well did you know this?

Not at all

Perfectly

Outlier - Detection Methods - Cluster based

Cluster data & find points without cluster
K-Means: Points away from any mean or Clusters with small number of points
Density-based: Points not assigned to cluster
Hierarchical: Points that need longer to join other groups

How well did you know this?

Not at all

Perfectly

Outlier - Detection Methods - Model-based

Fit probabilistic model -> outliers = examples with low probability (e.g. z-core = nr of sd away from mean)
Con: mean & variance are sensitive to outliers (-> solution: use quantiles, remove sequentially remove outliers)

How well did you know this?

Not at all

Perfectly

Outlier - Detection Methods - Distance-based

Measure distance & find distanced as outlier

How well did you know this?

Not at all

Perfectly

Outlier - Detection Methods - Distance-based - Global Outliers

KNN:
compute average distance for each point to its KNN
filter points that are most far from theire KNNs

How well did you know this?

Not at all

Perfectly

Outlier - Detection Methods - Distance-based - Local Outliers

x = (Avg. distance of point “i” to its KNNs) / (Avg. distance of neigbors of “i” to theire KNNS)
x > 1 = point more fare away than on average -> outlier

How well did you know this?

Not at all

Perfectly

Feature Selection - Goal

selecting features that are relevant to predict yi from xi
tradeoff: dont loose info but increase speed / memory space

How well did you know this?

Not at all

Perfectly

Feature Selection Approaches

1. Association
1. Regression Weight
1. Search & Score Methods
1. Forward Selection

How well did you know this?

Not at all

Perfectly

Feature Selection Approaches - Association

hypothesis testing for each feature to select
For each feature j: Compute correlation between feature values xj and y
Select j if correlation > 0.9 or <-0.9

How well did you know this?

Not at all

Perfectly

Feature Selection Approaches - Association - Con

Study These Flashcards

ignores variable interactions
-> includes irrelevant (e.g. Taco Tuesday -> sick)
-> excludes relevant variables (e.g. Coke + Peanuts -> sick)

Feature Selection Approaches - Search & Score Methods

Study These Flashcards

Define score function f(S) that measures quality of set of features “S”
sequentilly compute score of using different features (combinations)
select best feature(s)(combination)
Compute Cost of each combination result including L0 norm (cost function = cost + lambda*w (w = sum of non zero elements in vector) -> penalty for many features

Feature Selection Approaches - Search & Score Methods - Con

Study These Flashcards

large number of variable sets (2^d sets of variables for d variables) -> hard to find best “S”
prone to false positive (overfitting)
solution: use complexity penalties (score = f(S) + lambda*k (L0 Norm for linear models)

Feature Selection Approaches - Forward Selection (greedy search procedure)

Study These Flashcards

Starts with empty set of features S
For each possible feature j: Compute Scores of features in S combined with j
If j doesnt improve score: stop; otherwise: add j to S & move to 2. again

Feature Selection Approaches - Forward Selection (greedy search procedure - Pro

Study These Flashcards

cheaper (O(d^2)
overfits less
fewer false positives (> than naïve methods)

Feature Selection Approaches - Backward Selection & Recursive feature elimination (RFE)

Study These Flashcards

Start with all features
remove what most descrease cost score (stagewise)
RFE (type of backward selection): fit parameters of regression model & delete feature with small regression weight & repeat

Feature Selection Approaches - Regression Weight Approach

Study These Flashcards

regression model (e.g. linear) trained using all features & selection relevant by looking at weights (only use features where absolute value of weight > threshhold)

Feature Selection Approaches - Univariate Approach

Study These Flashcards

considers the relationship btw each feature and the target variable independently
based on statistical significance (chi-squared test or ANOVA)

Recommender System - Goal

Study These Flashcards

recommending based on user ratings from sold items

Recommender System - Scenarios

- Recommend items given item (e.g. Amazon) - Recommend items given user (e.g. Amazon / Netflix) - combination (personalized item-based recommendation)

User-Product Matrix

- each row = one user / customer - each column = 1 product

Recommender System - Issues

- Diversity: How different are recommendations? - Persistence: How long should recommendations last? - Trust: Tell user why made recommendation - Freshness: people tend to get more excited about new / surprising things

Recommender System - Types

- Content Filtering - Collaborative Filtering

Recommender System - Types - Content Filtering

- Assumes access to side info about items - e.g. Pandora - Supervised: extract features xi of user & item -> build & apply model rating yi given xi

Recommender System - Types - Collaborative Filtering

- Doesnt assumes access to side info about items e.g. Netflix - Unsupervised: have labels yij (rating of user i for movie j) but not features - assumes personal tastes correlate (e.g. Laura likes movie A & B, Mike likes A -> should also like B)

Recommender System - Types - Collaborative Filtering- Types

- Neigborhood: Find neighbors based on similarity movie preferences -> recommend theire movies - Latent Factor: assume movies & users live in low-dimensional space describing properties -> recommend movie based on its proximity to user

Recommender System - Types - Collaborative Filtering- Types - Matrix Factorization

- way to define model + objective function (optimizes with stochastic gradient descent)

Recommender System - Types - Collaborative Filtering- Types - Matrix Factorization - Classes

- Unconstrained MF - SVD - Non-negative MF (e.g. Fb has "like" but no "Dislike" button)

Feature Selection Approaches - Regression Weight Approach - Con

- collinerarity problem: if 2 variables are similar its hard to say which one to keep

! S7: Outlier Detection, Feature Selection & Recommender System Flashcards

(34 cards)