! S7: Outlier Detection, Feature Selection & Recommender System Flashcards
Outlier - Definition
- data object that deviates significantly from normal objects as if it were generated by different mechanism
- can be supervised or unsupervised
Outlier - issue
- hard to define precisely
Outlier - causes
- Measurement error
- Data entry error
- Contamination of data from different sources
- Rare event
Outlier - Application
- Data Cleaning
- Security & Fraud detection
- Detecting natural disasters
- Astronomy
- Genetics
Local vs Global Outlier
- Local = value within the range for entire dataset, but unusually high or low for surrounding points
- Global = very high or a very low value relative to all values in a dataset
- Collective = collection deviate significantly from entire data set, but
individual data points are not global or local outlier - Outlier Group
Outlier - Detection Methods
- Graphical
- Cluster based
- Model based
- Distance based
- Supervised learning
Outlier - Detection Methods - Graphical
- Plot data & look for weired points (human decide if value = outlier)
- e.g. Box Plot (1 variable at a time) & Scatterplot (2 variables)
Outlier - Detection Methods - Cluster based
- Cluster data & find points without cluster
- K-Means: Points away from any mean or Clusters with small number of points
- Density-based: Points not assigned to cluster
- Hierarchical: Points that need longer to join other groups
Outlier - Detection Methods - Model-based
- Fit probabilistic model -> outliers = examples with low probability (e.g. z-core = nr of sd away from mean)
- Con: mean & variance are sensitive to outliers (-> solution: use quantiles, remove sequentially remove outliers)
Outlier - Detection Methods - Distance-based
- Measure distance & find distanced as outlier
Outlier - Detection Methods - Distance-based - Global Outliers
- KNN:
- compute average distance for each point to its KNN
- filter points that are most far from theire KNNs
Outlier - Detection Methods - Distance-based - Local Outliers
- x = (Avg. distance of point “i” to its KNNs) / (Avg. distance of neigbors of “i” to theire KNNS)
- x > 1 = point more fare away than on average -> outlier
Feature Selection - Goal
- selecting features that are relevant to predict yi from xi
- tradeoff: dont loose info but increase speed / memory space
Feature Selection Approaches
- Association
- Regression Weight
- Search & Score Methods
- Forward Selection
Feature Selection Approaches - Association
- hypothesis testing for each feature to select
- For each feature j: Compute correlation between feature values xj and y
- Select j if correlation > 0.9 or <-0.9
Feature Selection Approaches - Association - Con
- ignores variable interactions
- -> includes irrelevant (e.g. Taco Tuesday -> sick)
- -> excludes relevant variables (e.g. Coke + Peanuts -> sick)
Feature Selection Approaches - Search & Score Methods
- Define score function f(S) that measures quality of set of features “S”
- sequentilly compute score of using different features (combinations)
- select best feature(s)(combination)
- Compute Cost of each combination result including L0 norm (cost function = cost + lambda*w (w = sum of non zero elements in vector) -> penalty for many features
Feature Selection Approaches - Search & Score Methods - Con
- large number of variable sets (2^d sets of variables for d variables) -> hard to find best “S”
- prone to false positive (overfitting)
- solution: use complexity penalties (score = f(S) + lambda*k (L0 Norm for linear models)
Feature Selection Approaches - Forward Selection (greedy search procedure)
- Starts with empty set of features S
- For each possible feature j: Compute Scores of features in S combined with j
- If j doesnt improve score: stop; otherwise: add j to S & move to 2. again
Feature Selection Approaches - Forward Selection (greedy search procedure - Pro
- cheaper (O(d^2)
- overfits less
- fewer false positives (> than naïve methods)
Feature Selection Approaches - Backward Selection & Recursive feature elimination (RFE)
- Start with all features
- remove what most descrease cost score (stagewise)
- RFE (type of backward selection): fit parameters of regression model & delete feature with small regression weight & repeat
Feature Selection Approaches - Regression Weight Approach
- regression model (e.g. linear) trained using all features & selection relevant by looking at weights (only use features where absolute value of weight > threshhold)
Feature Selection Approaches - Univariate Approach
- considers the relationship btw each feature and the target variable independently
- based on statistical significance (chi-squared test or ANOVA)
Recommender System - Goal
recommending based on user ratings from sold items
Recommender System - Scenarios
- Recommend items given item (e.g. Amazon)
- Recommend items given user (e.g. Amazon / Netflix)
- combination (personalized item-based recommendation)
User-Product Matrix
- each row = one user / customer
- each column = 1 product
Recommender System - Issues
- Diversity: How different are recommendations?
- Persistence: How long should recommendations last?
- Trust: Tell user why made recommendation
- Freshness: people tend to get more excited about new / surprising things
Recommender System - Types
- Content Filtering
- Collaborative Filtering
Recommender System - Types - Content Filtering
- Assumes access to side info about items
- e.g. Pandora
- Supervised: extract features xi of user & item -> build & apply model rating yi given xi
Recommender System - Types - Collaborative Filtering
- Doesnt assumes access to side info about items e.g. Netflix
- Unsupervised: have labels yij (rating of user i for movie j) but not features
- assumes personal tastes correlate (e.g. Laura likes movie A & B, Mike likes A -> should also like B)
Recommender System - Types - Collaborative Filtering- Types
- Neigborhood: Find neighbors based on similarity movie preferences -> recommend theire movies
- Latent Factor: assume movies & users live in low-dimensional space describing properties -> recommend movie based on its proximity to user
Recommender System - Types - Collaborative Filtering- Types - Matrix Factorization
- way to define model + objective function (optimizes with stochastic gradient descent)
Recommender System - Types - Collaborative Filtering- Types - Matrix Factorization - Classes
- Unconstrained MF
- SVD
- Non-negative MF (e.g. Fb has “like” but no “Dislike” button)
Feature Selection Approaches - Regression Weight Approach - Con
- collinerarity problem: if 2 variables are similar its hard to say which one to keep