Lecture 7 - Outlier Detection, Feature Selection, Similar Items, Recommender Systems, Naive Bayes Classifiers, Class Imbalance Flashcards

Question

What is the downside of Forward Selection, and why should we use it despite this downside?

Answer 1

Might not be able to capture the best variables But it is cheaper, overfits less, and has fewer false positives

Answer 2

It starts with computing the score of a set of all features, and then takes out one variable at a time to see if the score improves or not (removes the variable that results in the highest score improvement)

Answer 3

Column - products Rows - users 1 shows whether a user has purchased the specific product 0 shows the opposite

Answer 4

"closest point" problem: let's say that there are new products that are much better (to recommend), but they are considered as outliers → because that they have fewer people who bought them until now How does this relate to KNN, K-means, and Density-based clustering?

Answer 5

Content filtering: assumes access to side information about items (supervised learning) e.g., Gmail "important messages"" Collaborative filtering: does not assume access to side information about items (unsupervised learning) e.g., Netflix if Alice likes A and B, and Bob likes A, then Bob is likely to like B as well (does not really work for new users - cause we don't know their preferences yet)

Answer 6

Content filtering and collaborative filtering

Answer 7

Neighborhood method: Find neighbors based on similarity of movie preferences and then recommend movies that those neighbors watched Latent Factor method: Assume that both users and items live in a low dimensional space describing their properties and then recommend items based on their proximity to the user in the latent space

Answer 8

...Bayes Theorem... ...conditional probability...

Answer 9

Bayes Classifier is a supervised probabilistic model that makes the most probable predictions for new examples

Answer 10

Naive Bayes classifier that assumes conditional independence of input variables

Answer 11

What is the most probable classification of a new instance given the training data?

Answer 12

Collect many emails, get users to label them (spam/not spam) Extract features of emails (like bag of words) and create columns for each word 1 if the word is in email, 0 if it is not in email Use Gaussian NB function in Python to fit the model (see notes in Notion for more details)

Answer 13

Oversampling: SMOTE, ADASYN Undersampling: cluster

Answer 14

To fix class imbalance. To reach a ratio of 1:1 in the classes that we use

Answer 15

when you create synthetic data, the data should be having less noise, with fewer outliers and fewer missing values, so the model can create a good sense of missing data that it should add

Answer 16

1. Find KNN from each sample 2. Select samples randomly from a KNN 3. Find new samples = {original samples + difference * gap(0,1)} 4. Add new sample to minority A new dataset is created

Answer 17

O(dˆ2) - quadratic FS starts by choosing one variable and testing the performance of the model with that one variable only. But in this process of choosing, it needs to compute a score for the model containing each variable at a time (d times - if there are d features). Let's assume it chooses one variable. Then it needs to choose another one. This means that it needs to test again by adding each variable to the model ((d-1) times - because it already chose one previously, it cannot choose the same one again). Let's assume it chooses another variable. and so on After this whole process, it means that, in the worst-case scenario, the model was tested d + (d-1) + (d-2) + ... + 2 + 1 times. This equals d(d + 1)/2. This equals (dˆ2 + d)/2. This means that the complexity of the model is O(dˆ2)

Answer 18

O(2ˆd) - exponential Because Search and Score analyzes all possible combinations of sets that can be formed from the variables Let's assume we have the following variables {a, b, c} The algorithm is going to test the performance of the model with the following variables: { }, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c} 8 = 2ˆ3 2ˆn is the total number of subsets that you can build from the variable set containing n elements Therefore, if we want to use Search and Score to select feature, the algorithm is going to test the model 2ˆn times So, the complexity of Search and Score is O(2ˆn)

Answer 19

O(2ˆd) - exponential Because Search and Score analyzes all possible combinations of sets that can be formed from the variables Let's assume we have the following variables {a, b, c} The algorithm is going to test the performance of the model with the following variables: { }, {a}, {b}, {c}, {a, b}, {a, c}, {b, c}, {a, b, c}