Lecture 7 - Outlier Detection, Feature Selection, Similar Items, Recommender Systems, Naive Bayes Classifiers, Class Imbalance Flashcards

1
Q

What is an outlier?

A

An outlier is a data object that deviates significantly from normal objects as if it were generated by a different mechanism.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What can cause outliers?

A
  • Measurement errors
  • Data entry errors
  • Containment of data from different sources
  • Rare events
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

True or false: You should always try to remove outliers. It makes the machine learning algorithm better

A

FALSE: If the number of outliers are small, then it’s generally okay to remove them.
However, if the number is large, you have to think whether the outliers mean something, or whether it is ok to remove them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are some different methods to detect outliers?

A
  • Model-based
  • Graphical Approaches
  • Cluster-based
  • Distance-based
  • Supervised-learning
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

In model based outlier detection, in broad terms, how do we detect outliers?

A
  1. We fit a probabilistic model
  2. Outliers are cases with low probability

Example:

  • Assume data follows normal distribution
  • The z-score for 1D data is given by:
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the difference between a global and a local outlier?

A

A data point is a global outlier when it is out of the normal data range (i.e., most points are in a data range, and this data point is far out).

But, let’s say we have two clusters of points, and in between these clusters, there is a single standing point. We call this a local outlier because it is in the normal data range (given that to its right and left there are more data points), but cannot necessarily be assigned to any of the clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain the approach of the Graphical Outlier Detection

A

We plot the data and look for weird points

We (Human) decides whether a data point is an outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Can outliers be represented by groups?

A

Yes, they can. But remember, if the group has a relatively large number of data points, they might not be outliers, but they might be describing an “unusual” event that is worth to be captured in the data (so, do not remove them)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some graphical representations that we can use to detect outliers by eye?

A

Boxplot - plot one variable at a time (and look at the outliers as single standing points, or even analyze the interquartile range)

Scatterplot - plot two variables at a time (is able to capture more complex patterns)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does cluster-based outlier detection work?

What are the main algorithms that can help in doing that?

A

Cluster data and then find points that do not belong to any of the clusters.

  1. K-means: find points that are far away from any mean (but they have been categorized as a part of the cluster) or find clusters that have a small number of data points
  2. Density-based clustering: outliers are the points that have not been assigned to any cluster
  3. Hierarchical clustering: outliers take longer to be assigned to a group
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does distance-based clustering outlier detection work? (KNN outlier detection)

A

For each data point, compute the average distance to its KNNs.
Sort the set of N average distances.
Choose the biggest values as outliers.

btw: KNN was proved to be the most efficient in detecting global outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does supervised outlier detection work?

A

(I think) you can get a training dataset that has a column saying whether x is an outlier or not.
And you can use it to detect further outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When you want to find outliers graphically, you can analyze the IQR (interquartile range) in a box plot.

What is special about this IQR when the dataset has outliers?

A

When a dataset has outliers, the interquartile range is often able to summarize the variability in the data.

IQR = Q3 - Q1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the advantages and disadvantages to supervised outlier detection?

A

Advantage:
- Can find very complicated outlier patterns

Disadvantages:

  • Is supervised, i.e. needs column labeled “outlier”
  • Can not detect new “types” of outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can we define the process of “Feature Selection”?

A

Feature selection works by selecting features that are “relevant” for predicting the target variable (so variables that have a strong relationship with the target variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Name the different approaches of feature selection?

A
Association approach
Regression weight approach
Search and Score methods
Forward Selection
Backward Selection
Recursive Feature Elimination
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain how the Association approach of feature selection works.

A
  1. For each feature compute the correlation between the feature values and the target value y
  2. Say that the feature is relevant if the correlation is above or below some threshold (0.9 and -0.9, for example)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

True or False. The Association approach in feature selection is basically a sequential “hypothesis testing” process of the correlation between the variables.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are some downsides of the Association approach in Feature Selection?

A

It can give bad results as it ignores variable interactions

“Taco Tuesday” and “Coke and Mentos” example

20
Q

Explain how the “Regression Weight” approach of Feature Selection works.

A
  1. Fit regression weights based on all features (for example, with least-squares method)
  2. Approve all the features whose weight is higher than some threshold
21
Q

Name one advantage and one disadvantage of using the Regression Weight approach in Feature Selection

A

Advantage: can solve the issue of variable interactions that is present in the Association approach

Disadvantage: has issues with collinearity

22
Q

Explain the process of the “Search and Score” method in Feature Selection.

A
  1. Define a score function that measures the quality of a set of features
  2. Search for the set with the best score

The sets are represented by all the possible combinations of features in the dataset (including the empty set - so no features)

23
Q

How do you choose the score function in the “Search and Score” method in Feature Selection?

A

Simon Sardorf write here please

24
Q

How does Forward Selection (in Feature Selection) work?

A
  1. Compute Score with no feature
  2. Start adding features (one at a time) and compute the score
  3. Keep the feature with the best score
  4. Add another feature (one by one add all the different features)
  5. Compute Score
    and so on..
    STOP when no single variable addition improves the score
25
Q

What is the downside of Forward Selection, and why should we use it despite this downside?

A

Might not be able to capture the best variables

But it is cheaper, overfits less, and has fewer false positives

26
Q

How does Backward Selection differ from Forward Selection?

A

It starts with computing the score of a set of all features, and then takes out one variable at a time to see if the score improves or not (removes the variable that results in the highest score improvement)

27
Q

How is user-product matrix used in Recommender Systems?

A

Column - products
Rows - users

1 shows whether a user has purchased the specific product
0 shows the opposite

28
Q

What algorithm can be used to find similar products (in terms of the users that purchased them) in a user-product matrix?

A

KNN

29
Q

What is the “closest point” problem (exists in KNN, K-means, Density-based clustering, also Recommender Systems?

A

“closest point” problem: let’s say that there are new products that are much better (to recommend), but they are considered as outliers → because that they have fewer people who bought them until now

How does this relate to KNN, K-means, and Density-based clustering?

30
Q

What are the two types of Recommender Systems? What is the difference between them?

A

Content filtering: assumes access to side information about items (supervised learning)
e.g., Gmail “important messages””

Collaborative filtering: does not assume access to side information about items (unsupervised learning)
e.g., Netflix
if Alice likes A and B, and Bob likes A, then Bob is likely to like B as well

(does not really work for new users - cause we don’t know their preferences yet)

31
Q

What are the two types of Recommender Systems?

A

Content filtering and collaborative filtering

32
Q

What are the two methods used in Collaborative Filtering Recommender Systems? Explain them.

A

Neighborhood method: Find neighbors based on similarity of movie preferences and then recommend movies that those neighbors watched

Latent Factor method: Assume that both users and items live in a low dimensional space describing their properties and then recommend items based on their proximity to the user in the latent space

33
Q

True or False. Matrix Factorization is a way to define the model of Collaborative Filtering.

A

True

34
Q

Bayes Classifier is built on … that is based on …

A

…Bayes Theorem… …conditional probability…

35
Q

What kind of model is Bayes Classifier?

A

Bayes Classifier is a supervised probabilistic model that makes the most probable predictions for new examples

36
Q

What other model is used to solve the issue of Bayes Classifier assuming variable dependence on each other?

A

Naive Bayes classifier

that assumes conditional independence of input variables

37
Q

What question does Naive Bayes classifier answer?

A

What is the most probable classification of a new instance given the training data?

38
Q

Explain how Naive Bayes Classifier can be used in Spam Filtering.

A

Collect many emails, get users to label them (spam/not spam)

Extract features of emails (like bag of words) and create columns for each word

1 if the word is in email, 0 if it is not in email

Use Gaussian NB function in Python to fit the model (see notes in Notion for more details)

39
Q

Name two techniques that deal with oversampling and one technique that deals with undersampling

A

Oversampling: SMOTE, ADASYN
Undersampling: cluster

40
Q

What is the purpose of conducting over/undersampling?

A

To fix class imbalance. To reach a ratio of 1:1 in the classes that we use

41
Q

What is the requirement that we have to satisfy before oversampling (synthetically generate data) for the process to work well?

A

when you create synthetic data, the data should be having less noise, with fewer outliers and fewer missing values, so the model can create a good sense of missing data that it should add

42
Q

How does the SMOTE algorithm work?

A
  1. Find KNN from each sample
  2. Select samples randomly from a KNN
  3. Find new samples = {original samples + difference * gap(0,1)}
  4. Add new sample to minority

A new dataset is created

43
Q

What is the complexity of a Forward Selection algorithm?

A

O(dˆ2) - quadratic

FS starts by choosing one variable and testing the performance of the model with that one variable only. But in this process of choosing, it needs to compute a score for the model containing each variable at a time (d times - if there are d features). Let’s assume it chooses one variable.

Then it needs to choose another one. This means that it needs to test again by adding each variable to the model ((d-1) times - because it already chose one previously, it cannot choose the same one again). Let’s assume it chooses another variable. and so on

After this whole process, it means that, in the worst-case scenario, the model was tested

d + (d-1) + (d-2) + … + 2 + 1 times. This equals d(d + 1)/2. This equals (dˆ2 + d)/2.

This means that the complexity of the model is O(dˆ2)

44
Q

What is the complexity of a Search and Score algorithm?

A

O(2ˆd) - exponential

Because Search and Score analyzes all possible combinations of sets that can be formed from the variables

Let’s assume we have the following variables {a, b, c}
The algorithm is going to test the performance of the model with the following variables:
{ }, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c}
8 = 2ˆ3

2ˆn is the total number of subsets that you can build from the variable set containing n elements

Therefore, if we want to use Search and Score to select feature, the algorithm is going to test the model 2ˆn times

So, the complexity of Search and Score is O(2ˆn)

45
Q

What is the complexity of a Search and Score algorithm?

A

O(2ˆd) - exponential

Because Search and Score analyzes all possible combinations of sets that can be formed from the variables

Let’s assume we have the following variables {a, b, c}
The algorithm is going to test the performance of the model with the following variables:
{ }, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c}