! S7: Outlier Detection, Feature Selection & Recommender System Flashcards

1
Q

Outlier - Definition

A
  • data object that deviates significantly from normal objects as if it were generated by different mechanism
  • can be supervised or unsupervised
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Outlier - issue

A
  • hard to define precisely
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Outlier - causes

A
  • Measurement error
  • Data entry error
  • Contamination of data from different sources
  • Rare event
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Outlier - Application

A
  • Data Cleaning
  • Security & Fraud detection
  • Detecting natural disasters
  • Astronomy
  • Genetics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Local vs Global Outlier

A
  • Local = value within the range for entire dataset, but unusually high or low for surrounding points
  • Global = very high or a very low value relative to all values in a dataset
  • Collective = collection deviate significantly from entire data set, but
    individual data points are not global or local outlier
  • Outlier Group
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Outlier - Detection Methods

A
  • Graphical
  • Cluster based
  • Model based
  • Distance based
  • Supervised learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Outlier - Detection Methods - Graphical

A
  • Plot data & look for weired points (human decide if value = outlier)
  • e.g. Box Plot (1 variable at a time) & Scatterplot (2 variables)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Outlier - Detection Methods - Cluster based

A
  • Cluster data & find points without cluster
  • K-Means: Points away from any mean or Clusters with small number of points
  • Density-based: Points not assigned to cluster
  • Hierarchical: Points that need longer to join other groups
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Outlier - Detection Methods - Model-based

A
  • Fit probabilistic model -> outliers = examples with low probability (e.g. z-core = nr of sd away from mean)
  • Con: mean & variance are sensitive to outliers (-> solution: use quantiles, remove sequentially remove outliers)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Outlier - Detection Methods - Distance-based

A
  • Measure distance & find distanced as outlier
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Outlier - Detection Methods - Distance-based - Global Outliers

A
  • KNN:
  • compute average distance for each point to its KNN
  • filter points that are most far from theire KNNs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Outlier - Detection Methods - Distance-based - Local Outliers

A
  • x = (Avg. distance of point “i” to its KNNs) / (Avg. distance of neigbors of “i” to theire KNNS)
  • x > 1 = point more fare away than on average -> outlier
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Feature Selection - Goal

A
  • selecting features that are relevant to predict yi from xi
  • tradeoff: dont loose info but increase speed / memory space
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Feature Selection Approaches

A
    1. Association
    1. Regression Weight
    1. Search & Score Methods
    1. Forward Selection
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Feature Selection Approaches - Association

A
  • hypothesis testing for each feature to select
  • For each feature j: Compute correlation between feature values xj and y
  • Select j if correlation > 0.9 or <-0.9
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Feature Selection Approaches - Association - Con

A
  • ignores variable interactions
  • -> includes irrelevant (e.g. Taco Tuesday -> sick)
  • -> excludes relevant variables (e.g. Coke + Peanuts -> sick)
16
Q

Feature Selection Approaches - Search & Score Methods

A
  • Define score function f(S) that measures quality of set of features “S”
  • sequentilly compute score of using different features (combinations)
  • select best feature(s)(combination)
  • Compute Cost of each combination result including L0 norm (cost function = cost + lambda*w (w = sum of non zero elements in vector) -> penalty for many features
17
Q

Feature Selection Approaches - Search & Score Methods - Con

A
  • large number of variable sets (2^d sets of variables for d variables) -> hard to find best “S”
  • prone to false positive (overfitting)
  • solution: use complexity penalties (score = f(S) + lambda*k (L0 Norm for linear models)
18
Q

Feature Selection Approaches - Forward Selection (greedy search procedure)

A
  • Starts with empty set of features S
  • For each possible feature j: Compute Scores of features in S combined with j
  • If j doesnt improve score: stop; otherwise: add j to S & move to 2. again
19
Q

Feature Selection Approaches - Forward Selection (greedy search procedure - Pro

A
  • cheaper (O(d^2)
  • overfits less
  • fewer false positives (> than naïve methods)
20
Q

Feature Selection Approaches - Backward Selection & Recursive feature elimination (RFE)

A
  • Start with all features
  • remove what most descrease cost score (stagewise)
  • RFE (type of backward selection): fit parameters of regression model & delete feature with small regression weight & repeat
21
Q

Feature Selection Approaches - Regression Weight Approach

A
  • regression model (e.g. linear) trained using all features & selection relevant by looking at weights (only use features where absolute value of weight > threshhold)
22
Q

Feature Selection Approaches - Univariate Approach

A
  • considers the relationship btw each feature and the target variable independently
  • based on statistical significance (chi-squared test or ANOVA)
23
Q

Recommender System - Goal

A

recommending based on user ratings from sold items

24
Q

Recommender System - Scenarios

A
  • Recommend items given item (e.g. Amazon)
  • Recommend items given user (e.g. Amazon / Netflix)
  • combination (personalized item-based recommendation)
25
Q

User-Product Matrix

A
  • each row = one user / customer
  • each column = 1 product
26
Q

Recommender System - Issues

A
  • Diversity: How different are recommendations?
  • Persistence: How long should recommendations last?
  • Trust: Tell user why made recommendation
  • Freshness: people tend to get more excited about new / surprising things
27
Q

Recommender System - Types

A
  • Content Filtering
  • Collaborative Filtering
28
Q

Recommender System - Types - Content Filtering

A
  • Assumes access to side info about items
  • e.g. Pandora
  • Supervised: extract features xi of user & item -> build & apply model rating yi given xi
29
Q

Recommender System - Types - Collaborative Filtering

A
  • Doesnt assumes access to side info about items e.g. Netflix
  • Unsupervised: have labels yij (rating of user i for movie j) but not features
  • assumes personal tastes correlate (e.g. Laura likes movie A & B, Mike likes A -> should also like B)
30
Q

Recommender System - Types - Collaborative Filtering- Types

A
  • Neigborhood: Find neighbors based on similarity movie preferences -> recommend theire movies
  • Latent Factor: assume movies & users live in low-dimensional space describing properties -> recommend movie based on its proximity to user
31
Q

Recommender System - Types - Collaborative Filtering- Types - Matrix Factorization

A
  • way to define model + objective function (optimizes with stochastic gradient descent)
32
Q

Recommender System - Types - Collaborative Filtering- Types - Matrix Factorization - Classes

A
  • Unconstrained MF
  • SVD
  • Non-negative MF (e.g. Fb has “like” but no “Dislike” button)
33
Q

Feature Selection Approaches - Regression Weight Approach - Con

A
  • collinerarity problem: if 2 variables are similar its hard to say which one to keep