ML Interview Prep Flashcards

1
Q

What is bias?

A

Distance of predicted data points from actual targets.

Low bias: predicted data points are close to target (overfitting).

High bias: predicted data points are far from target.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is variance?

A

The variability of a prediction for a given data point or a value which tells us spread of our data. High variance: pays a lot of attention to training data and doesn’t generalize on unseen data. Such models perform very well on training data but not on test data.

Other definition: Variance is the amount that the estimate of the target function will change if different training data was used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the Bias-Variance Tradeoff.

A

Predictive models have a tradeoff between bias (how well the model fits the data) and variance (how much the model changes based on changes in the inputs).

Simpler models are stable (low variance) but they don’t get close to the truth (high bias).

More complex models are more prone to being overfit (high variance) but they are expressive enough to get close to the truth (low bias).

The best model for a given problem usually lies somewhere in the middle.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the difference between Stochastic Gradient and Gradient Descent.

A

Gradient descent is an optimization algorithm that’s used when training a machine learning model. It’s based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum.

GD: evaluate all training samples for each set of parameters.

SGD: evaluate 1 training sample for the set of parameters before updating them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain the difference between supervised and unsupervised machine learning.

A

Supervised learning requires labeled data and uses a ground truth, meaning we have existing knowledge of our outputs and samples. Goal is to learn a function that approximates a relationship between inputs and outputs.

Unsupervised learning does not use labeled outputs. The goal here is to infer the natural structure in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Give 3 of the most common algorithms for supervised learning and unsupervised learning?

A

Supervised learning algorithms:

Linear regression
Logistic regression
Decision trees
Random forests
Naive Bayes

Examples of unsupervised algorithms:

k-Means
Visualization and dimensionality reduction
Principal component analysis (PCA)
t-distributed Stochastic neighbor embedding (t-SNE)
Association rule learning (Apriori)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the Bayes’ Theorem and why do we use it?

A

Bayes’ Theorem is how we find a probability when we know other probabilities (posterior probability of a prior knowledge event). It is a way of calculating conditional probabilities.

In ML, Bayes’ theorem is used in a probability framework that fits a model to a training dataset and for building classification predictive modeling problems (i.e. Naive Bayes, Bayes Optimal Classifier).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are Naive Bayes’ Classifiers?

A

Naive Bayes classifiers assume that the occurrence or absence of a feature does not influence the presence or absence of another feature.

When the assumption of independence holds, they are easy to implement and yield better results than other sophisticated predictors. They are used in spam filtering, text analysis, and recommendation systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a Discriminative model?

A

Discriminative models are a class of logistical models used for classification or regression. They distinguish decision boundaries through observed data (pass/fail, win/lose, healthy/sick).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Segmentation

A

Dense prediction task of pixel-wise classification.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

FCN (Fully Convolutional Network)

A

Works by fine tuning an image classification CNN and applying pixel wise training.

  1. Compresses info using multiple layers of convolutions and pooling.
  2. Up-samples these features maps to predict each pixels class from compressed info.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Tokenization

A

Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Embeddings

A

Low dimensional space in which we translate high dimensional vectors. Semantically similar inputs are placed closer together in the embedding space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

CNN

A

Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Semantic Search

A

Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Feed Forward Neural Network

A

Connections between nodes do not form a cycle.

Uses multilayer perceptrons.

17
Q

Time Series Model

A

Model where datapoints are indexed in time order. Time series forecasting aims to predict future values based on previously observed values.

18
Q

What do Convolutional layers and pooling layers do?

A

Convolutional layers summarize presence of features in an input range and output feature maps.
The output feature maps are sensitive to location of features in the input image.
Pooling layers: down sample feature maps by summarizing the presence of features in patches of the feature map.
Two common pooling methods are average pooling and max pooling that summarize the average presence of a feature and the most activated presence of a feature respectively.

19
Q

Cost Function

A

Y

20
Q

Self Attention Mechanism

A

Y

21
Q

Encoder/Decoder

A

Time steps in an input sequence are encoded into a fixed length vector (context vector).
UNFINISHED

22
Q

Correlation

A

Linear relationship between any two random variables (extent to which they change together at a constant rate).

23
Q

Pearson’s Correlation Coefficient

A

Quantifies the linear relationship between two variables. It is the covariance divided by the product of the standard deviations of the samples.

24
Q

Covariance

A

Y

25
Q

Transformer Architecture

A

Contains

26
Q

Word Embeddings with Self-Attention

A

Attention mechanism adds contextual info to word embeddings so model can derive their meaning.
E.g “Bank of a river.” Self attention identifies the correlation between the term “Bank” and “river”.
Compares a word in a sentence with every other word; word embedding is reweighted to include relevance of word to its own meaning.

27
Q

Self Attention block steps.

A

Dot product similarity to get alignment scores.
Normalization of the scores to get weights.
Reweighing of original embeddings with those weights.

28
Q

Masked Language Modeling(MLM)

A

A few words are masked in the input, and the model predicts them based on the bi-directional context.

29
Q

Next Sentence Prediction (NSP)

A

Two sentences A and B are given. The model has to predict if B comes after A in the corpus or is it just a random sentence.

30
Q

What are Activation functions and types?

A

Y

31
Q

Fourier Transform

A

The fourier series is a method of breaking down signals into frequency components. It is applicable to non-periodic signals such as a delta function and enables such signals to be measured in terms of frequencies instead of time. Fourier transform is useful when you are working on a system where the transfer function is known.

32
Q

t-SNE (t-distributed Stochastic Neighbor Embedding)

A

Non-linear dimensionality reduction algorithm used for exploring high dimensional data.
Maps high-d data to low dimensional space (e.g. 2-3 dimensions).

33
Q

L1 and L2 Regularization

A

REGULARIZATION - Used to reduce model complexity when you have a large number of features in data.

Difference between algorithms is the penalty term.
L1 (Lasso): Adds abs value of magnitude of coeff to the loss function.
L2 (Ridge): Adds squared magnitude of coeff as penalty term to the loss function.

Difference in practice:
Lasso shrinks less important feature coefficients to zero, so some features disappear altogether. Good for feature selection with data that has a huge number of features.

34
Q

Cross Validation

A

Shuffle the dataset.
Split the dataset into k groups
For each unique group:
Take the group as a hold out or test data set
Take the remaining groups as a training data set
Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model evaluation scores