ML Interview Prep Flashcards by Duncan Freeman

What is bias?

Distance of predicted data points from actual targets.

Low bias: predicted data points are close to target (overfitting).

High bias: predicted data points are far from target.

How well did you know this?

Not at all

Perfectly

What is variance?

The variability of a prediction for a given data point or a value which tells us spread of our data. High variance: pays a lot of attention to training data and doesn’t generalize on unseen data. Such models perform very well on training data but not on test data.

Other definition: Variance is the amount that the estimate of the target function will change if different training data was used.

How well did you know this?

Not at all

Perfectly

Explain the Bias-Variance Tradeoff.

Predictive models have a tradeoff between bias (how well the model fits the data) and variance (how much the model changes based on changes in the inputs).

Simpler models are stable (low variance) but they don’t get close to the truth (high bias).

More complex models are more prone to being overfit (high variance) but they are expressive enough to get close to the truth (low bias).

The best model for a given problem usually lies somewhere in the middle.

How well did you know this?

Not at all

Perfectly

What is the difference between Stochastic Gradient and Gradient Descent.

Gradient descent is an optimization algorithm that’s used when training a machine learning model. It’s based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum.

GD: evaluate all training samples for each set of parameters.

SGD: evaluate 1 training sample for the set of parameters before updating them.

How well did you know this?

Not at all

Perfectly

Explain the difference between supervised and unsupervised machine learning.

Supervised learning requires labeled data and uses a ground truth, meaning we have existing knowledge of our outputs and samples. Goal is to learn a function that approximates a relationship between inputs and outputs.

Unsupervised learning does not use labeled outputs. The goal here is to infer the natural structure in a dataset.

How well did you know this?

Not at all

Perfectly

Give 3 of the most common algorithms for supervised learning and unsupervised learning?

Supervised learning algorithms:

Linear regression
Logistic regression
Decision trees
Random forests
Naive Bayes

Examples of unsupervised algorithms:

k-Means
Visualization and dimensionality reduction
Principal component analysis (PCA)
t-distributed Stochastic neighbor embedding (t-SNE)
Association rule learning (Apriori)

How well did you know this?

Not at all

Perfectly

What is the Bayes’ Theorem and why do we use it?

Bayes’ Theorem is how we find a probability when we know other probabilities (posterior probability of a prior knowledge event). It is a way of calculating conditional probabilities.

In ML, Bayes’ theorem is used in a probability framework that fits a model to a training dataset and for building classification predictive modeling problems (i.e. Naive Bayes, Bayes Optimal Classifier).

How well did you know this?

Not at all

Perfectly

What are Naive Bayes’ Classifiers?

Naive Bayes classifiers assume that the occurrence or absence of a feature does not influence the presence or absence of another feature.

When the assumption of independence holds, they are easy to implement and yield better results than other sophisticated predictors. They are used in spam filtering, text analysis, and recommendation systems.

How well did you know this?

Not at all

Perfectly

What is a Discriminative model?

Discriminative models are a class of logistical models used for classification or regression. They distinguish decision boundaries through observed data (pass/fail, win/lose, healthy/sick).

How well did you know this?

Not at all

Perfectly

Segmentation

Dense prediction task of pixel-wise classification.

How well did you know this?

Not at all

Perfectly

FCN (Fully Convolutional Network)

Works by fine tuning an image classification CNN and applying pixel wise training.

Compresses info using multiple layers of convolutions and pooling.
Up-samples these features maps to predict each pixels class from compressed info.

How well did you know this?

Not at all

Perfectly

Tokenization

How well did you know this?

Not at all

Perfectly

Embeddings

Low dimensional space in which we translate high dimensional vectors. Semantically similar inputs are placed closer together in the embedding space.

How well did you know this?

Not at all

Perfectly

CNN

How well did you know this?

Not at all

Perfectly

Semantic Search

How well did you know this?

Not at all

Perfectly

Feed Forward Neural Network

Study These Flashcards

Connections between nodes do not form a cycle.

Uses multilayer perceptrons.

Time Series Model

Study These Flashcards

Model where datapoints are indexed in time order. Time series forecasting aims to predict future values based on previously observed values.

What do Convolutional layers and pooling layers do?

Study These Flashcards

Convolutional layers summarize presence of features in an input range and output feature maps.
The output feature maps are sensitive to location of features in the input image.
Pooling layers: down sample feature maps by summarizing the presence of features in patches of the feature map.
Two common pooling methods are average pooling and max pooling that summarize the average presence of a feature and the most activated presence of a feature respectively.

Cost Function

Study These Flashcards

Self Attention Mechanism

Study These Flashcards

Encoder/Decoder

Study These Flashcards

Time steps in an input sequence are encoded into a fixed length vector (context vector).
UNFINISHED

Correlation

Study These Flashcards

Linear relationship between any two random variables (extent to which they change together at a constant rate).

Pearson’s Correlation Coefficient

Study These Flashcards

Quantifies the linear relationship between two variables. It is the covariance divided by the product of the standard deviations of the samples.

Covariance

Study These Flashcards

Transformer Architecture

Contains

Word Embeddings with Self-Attention

Attention mechanism adds contextual info to word embeddings so model can derive their meaning. E.g “Bank of a river.” Self attention identifies the correlation between the term “Bank” and “river”. Compares a word in a sentence with every other word; word embedding is reweighted to include relevance of word to its own meaning.

Self Attention block steps.

Dot product similarity to get alignment scores. Normalization of the scores to get weights. Reweighing of original embeddings with those weights.

Masked Language Modeling(MLM)

A few words are masked in the input, and the model predicts them based on the bi-directional context.

Next Sentence Prediction (NSP)

Two sentences A and B are given. The model has to predict if B comes after A in the corpus or is it just a random sentence.

What are Activation functions and types?

Fourier Transform

The fourier series is a method of breaking down signals into frequency components. It is applicable to non-periodic signals such as a delta function and enables such signals to be measured in terms of frequencies instead of time. Fourier transform is useful when you are working on a system where the transfer function is known.

t-SNE (t-distributed Stochastic Neighbor Embedding)

Non-linear dimensionality reduction algorithm used for exploring high dimensional data. Maps high-d data to low dimensional space (e.g. 2-3 dimensions).

L1 and L2 Regularization

REGULARIZATION - Used to reduce model complexity when you have a large number of features in data. Difference between algorithms is the penalty term. L1 (Lasso): Adds abs value of magnitude of coeff to the loss function. L2 (Ridge): Adds squared magnitude of coeff as penalty term to the loss function. Difference in practice: Lasso shrinks less important feature coefficients to zero, so some features disappear altogether. Good for feature selection with data that has a huge number of features.

Cross Validation

Shuffle the dataset. Split the dataset into k groups For each unique group: Take the group as a hold out or test data set Take the remaining groups as a training data set Fit a model on the training set and evaluate it on the test set Retain the evaluation score and discard the model Summarize the skill of the model using the sample of model evaluation scores

ML Interview Prep Flashcards

(34 cards)