General ML Flashcards

1
Q

Bias vs Variance Tradeoffs

A

Bias is error introduced by approximating a real-world problem, which may be complex, by a simplified model
- Leads to undercutting.

Variance is error introduced by the model’s sensitivity to small fluctuations in the training dataset, causing it to model noise rather than intended outputs
- Leads to overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Supervised vs Unsupervised

A

Supervised uses labeled data, while unsupervised uses algorithms to find patterns within data with no explicitly right answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Likelihood vs Probability

A
  • Probability is the likelihood of an event happening under certain conditions, while likelihood is a measure of how well data supports a model or parameter values. Probability is more general while likehoood is used in statistics models and inference.
  • Probability refers to the possibility of something happening, Likelihood refers to the process of determining the best data distribution given a specific situation in the data. When calculating the probability of a given outcome, you assume the model’s parameters are reliable.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

KNN vs K-means

A

K-means _> Unsupervised clustering algorithm
- Set of unlabeled points and a threshold, also will take points and learn how to cluster them by mean distance.

KNN _> supervised classification algorithm.
- labeled data you want to classify an un-labeled point in
- Seeing if a post should or should’t be monetized based on other factors of the post

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Type I error vs Type II error

A

Type I error is a false positive —> claiming something happened when it didn’t —> telling a man he is pregnant
Type II error is a false negative —> claiming something didn’t happen when it did. —> tell a pregnant woman she isn’t carrying a baby.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sensitivity vs Specificity

A

Sensitivity focuses on identifying the positive instances correctly, so critical when you want to minimize false negatives -> Medical Tests, TP/TP + FN

Specificity focuses on identifying negative instances correctly, so is critical when you care about minimizing false positives. -> Spam filters. TN/TN + FP

•	Sensitivity: “Sensitive to catching Positives.”
•	Specificity: “Specific to excluding Negatives.”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Accuracy

A

General metric used to measure how well a classification model performs across all cases (both positive and negative). Proportion of correct predictions (both true positives and true negatives) out of the total number of predictions.

Accuracy = True Positives + True Negatives / Total prediction.

Useful: When class distribution is balanced, where positives and negatives are roughly equal, and when BOTH false positives and false negatives have similar consequences, and you care about the overall performance of the model.

When it is misleading:
- Imbalanced Classes (95% positive, 5% negative) it can always predict the majority class and have high accuracy, but it won’t be useful for the minority class. (disease detection where only small % of class is positive)
- Different costs for FP and FN

Image detection, sentiment anaylsis for customer reviews.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Precision

A

Metric used to evaluate the performance of a classification model, especially in the context of binary classification. Measures how many of the positive predictions made by the model are actually correct.

Precision = True Positives / (True Positives + False Positives)
- Note: comparing both actual positives, AND negatives that were classified as positive, meaning it is only evaluating the models ability to classify the positive class, without regarding its ability to classify the negative class.

Particularly useful when the cost of false positives is high, meaning its important to minimize the number of incorrect positive predictions. For example, if a model is predicting whether emails are spam, and it identifies 10 emails as spam, but only 7 of them are actually spam, then precision is 70%, 7/ 7 + 3

Use Cases: Spam detection, product recommendation, Ad-click prediction (showing irrelevant ads waste ad spend).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Recall - TPR - Sensitivity

A

Proportion of actual positives that are correctly identified by the classifier.
TPR = True Positives / (True Positives + False Negatives)
tp/ tp + fn

Note: all values in equation relate to only the positives that are classified or lack thereof, so it is a good metric to use when we don’t really care about false positives.

Say you said there were 10 apples and 5 oranges in a case of only 10 apples.
- Recall = 100%, there were 10 and you said there were 10
- Precision: 10 / 10 + 5 = 66.7% because out of the 15 events you predicted only 10 are correct.

Use case: Cancer predictions, Fraud detection, Search and rescue, Customer Churn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

FPR

A

FPR = False Positives / (False Positives + True Negatives)
fp / fp + tn

Note: all values in equation relate to only the negatives that are being classified or lack thereof.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

ROC

A

Measuring Sensitivity vs Fallout - Used for Binary Classification

  • Graphical representation of the contrast between TPRs and FPRs at various thresholds.
  • Proxy for the trade-off between the sensitivity of the model (true positives) vs the fall-out or the probability it will trigger a false alarm (false positives).
  • It helps visualize how well the model discriminates between the positive and negative classes, and allows the engineer to choose the correct algorithm for logistic regression for example.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

F1 Score

A
  • Weighted average of the precision and recall of a model, with results tending to 1 being the best. You can use it when doing classification where true negatives don’t matter much.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

AUC

A

The AUC is the area under the ROC curve, and provides a single value to summarize the overall performance of the classifier ranging from 0 - 1
- AUC = 1 -> perfect model
- AUC = 0.5 -> model performs no better than random guessing
- AUC < 0.5 -> model that is worse than random guessing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Bayes Theorem and why classier is called Naive Bayes.

A

P(A|B) = P(B|A) * P(A) / P(B)
- P(bought|click) = P(click|bought) * P(bought) / P(click)
- Allows us to find the probability of a cause given its effect.
- Why is it called Naive bayes?
- It assumes absolute independence of features (probably never met in real life) which isn’t the case in this scenario.
- Independence: B happening has no affect on A happening.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

L1 vs L2 Regularization + Elastic Net

A
  • L1 (Lasso)
    • Encourages sparsity in the model by adding the absolute values of the coefficients to the loss function
    • Effect: Drives some coefficients to exactly zero, effectively performing feature selection. This can be particularly useful when dealing with a large number of features, as it helps in identifying the most important predictors
  • L2 (Ridge)
    • Penalizes large coefficients by adding the squared values of the coefficients to the loss function
    • Effect: Shrinks coefficients towards zero but not set them to zero. This helps in handling multicollinearity and improving the stability and generalization of the model.

Elastic Net does both, by adding both the absolute and square value toe the loss function, inherently having benefits from both.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Multicollinearity

A

Refers to the situation in which two or more independent variables in a regression model are highly correlated, meaning they have a strong linear relationship, making it hard to distinguish the affects of these variables individually on the dependent variable.

You can detect this using correlation matrix.

Feature 2: Buying an item
These can

Feature 1: Adding credit card info
Feature 2: Buying an item
These can have direct correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Fourier Transform

A
  • Analogy -> given a smoothie, a Fourier transform would be applied to find the recipe
    • Finds set of cycle speeds, amplitudes and phases to match any time signal.
  • Finds set of amplitudes and phases to match any time signal
  • Converts signal from time to frequency domain and can be used to extract data from audio signals, or other time series data that may come from sensors (maybe like the athlete data you used to find cadence)
18
Q

what cross validation technique would you use for timeseries data.

A
  • K-Folds, as time series is not randomly distributed, and is ordered chronologically.
  • Fold 1: training [1] test [2]
  • Fold 2: training [1, 2]. Test [3]
  • Etc…
19
Q

Accuracy vs Performance Tradeoff

A
  • If you wanted to detect fraud in a massive dataset with a sample of millions, a more accurate model would likely predict no fraud at all, but that’s useless for a predictive model.
20
Q

Imbalanced Dataset?

A
      • Collect more data
      • Re-sample
      • Try a different algorithm.
21
Q

Overfitting?

A
  • Keep model simpler: reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise
  • Use cross validation techniques such as k-folds
  • Use regularization techniques such as LASSO that penalize certain model params if they’re likely to cause overfitting.
22
Q

Kernel trick?

A

eplacing an inner product of features with a simple kernel function, corresponding to a large or infinite set of basis functions. The kernel trick can provide high accuracy results on difficult datasets which have non-linear boundary decisions. It allows us to transform data that are not linearly separable into another space where they’re better fit for use with linear models like SVM’s.

23
Q

What’s the difference between a generative and discriminative model?

A
  • Generative learns categories while discriminative learns distinctions between categories
    • K-means / knn is discriminative
    • NLP classification is generative.
24
Q

Stratified Cross-Validation

A

Cross-validation is a technique for dividing data between training and validation sets. On typical cross-validation this split is done randomly. But in stratified cross-validation, the split preserves the ratio of the categories on both the training and validation datasets.
For example, if we have a dataset with 10% of category A and 90% of category B, and we use stratified cross-validation, we will have the same proportions in training and validation. In contrast, if we use simple cross-validation, in the worst case we may find that there are no samples of category A in the validation set.
Stratified cross-validation may be applied in the following scenarios:
- On a dataset with multiple categories. The smaller the dataset and the more imbalanced the categories, the more important it will be to use stratified cross-validation.

- On a dataset with data of different distributions. For example, in a dataset for autonomous driving, we may have images taken during the day and at night. If we do not ensure that both types are present in training and validation, we will have generalization problems.

25
Q

What is Overfitting?

A

Overfitting occurs when a model performs well on training data but poorly on unseen data, usually due to learning noise and overly complex patterns

26
Q

What is Underfitting?

A

Underfitting occurs when a model is too simple and cannot capture the underlying patterns in the. data, leading to poor performance on both training and test sets.

27
Q

What is cross validation?

A

Technique for assessing how a model will generalize to an independent dataset by partitioning the data into training and validation sets multiple times.

28
Q

What is regularization?

A

Regularization is a technique used to prevent overfitting by adding a penalty to the loss function.

29
Q

What is gradient descent?

A

An optimization algorithm that minimizes the cost function by iteratively moving in the direction of the steepest descent. (minimize error).

30
Q

What are hyperparameters?

A

Hyperparameters are configuration values that must be set before training a model, such as learning rate, batch size, and regularization strength

31
Q

What is a confusion matrix?

A

A table used to evaluate the performance of a classification model, showing the actual vs predicted classifications

32
Q

What is data normalization?

A

The process of scaling input data so that it has a mean of 0 and a standard deviation of 1, ensuring features have similar ranges. (moving wards the normal distribution which ML algos work well on)

33
Q

What is feature engineering?

A

the process of selecting, modifying or creating input features that improve model performance

34
Q

What is a learning curve?

A

A plot that shows how a model’s performance on both training and validation data changes as more training examples are used.

35
Q

What is the difference between parametric and non-parametric models?

A

Parametric models assume a specific form for the function (e.g. Linear), while non-parametric models do not make assumptions about the functions form (KNN).

36
Q

What is the curse of dimensionality?

A

As the number of features grows, the amount of data needed to generalize accurately increases exponentially.

37
Q

What is resampling?

A

Techniques like bootstrapping or cross-validation to estimate the performance of a model by using different subsets of data.

38
Q

What is over-sampling and under-sampling

A

Over-sampling increases the minority class samples, and under-sampling reduces the majority class samples in an imbalanced dataset.

39
Q

What is feature selection?

A

Process of selecting a subset of relevant features for training a model, often using techniques like Recursive Feature Elimination (RFE) or mutual information.

40
Q

What is the No Free Lunch Theorem?

A

There is no universally best model; the effectiveness of a model depends on the specific problem and data.

41
Q

Matrix Factorization

A

Its the process of filling sparse data with “predictions” based on latent factors (derived from things like ratings of other movies). An example of this is Collaborative Filtering, as it recommends items to users based on their past preferences.