Machine Learning Flashcards by Brendan A

Explain how SVM works

Finds the hyperplane which best separates classes of data. It does by maximizing the margin between the the support vectors (the hardest points to classify).

Hard Margin - Linearly seperable
Soft Margin - Allows for missclassification

How well did you know this?

Not at all

Perfectly

What is the typical ML workflow?

Define the problem + metric
Collect data
EDA
Data Cleaning / Transformation
Define train/valid/test splits
Build baseline model
Model development
Model deployment
Model monitoring
Iterate

How well did you know this?

Not at all

Perfectly

What is Pyspark? How have you used this?

Distributed data processing library. It also has some machine learning capability, data streaming, and a module for working with graph data.

How well did you know this?

Not at all

Perfectly

What are the services on Azure/GCP/AWS that are most commonly used?

Compute - EC2 / Compute / VMs
Storage - Storage / S3 / Blob

How well did you know this?

Not at all

Perfectly

Explain backpropagation

Backpropagation is how a neural network updates its weights. It does this by computing partial derivatives of a loss function with respect to each parameter in its network.

Computes these using a forward pass, a backward pass, and a weight update (gradient descent).

How well did you know this?

Not at all

Perfectly

What are some different optimizers? Explain a few of them?

Adam, SGD, SGD w/ Momentum, ADAN, Lion.

SGD - Minimize gradient at this point in time
Momentum - ball rolling down a hill
AdaGrad - scales LR based on magintude of parameter updates
Adam - Scales LR based on past gradients and second moments. Adjusts LR for each parameter individually.

How well did you know this?

Not at all

Perfectly

What are some ways you can normalize your data?

StandardScaler, MinMax scaling, LogScaling, PowerTransformation, Quantile, One hot encoding.

How well did you know this?

Not at all

Perfectly

How do you handle missing values?

Depends.

Fill w/ Mode/Mean/Median
Drop row/column
Manually input true value based on other columns or rows
Use model-assisted imputation (eg. K-means, KNN)

How well did you know this?

Not at all

Perfectly

What are the assumptions required for linear regression?

Linearity: X - Y relationship is linear
Homescedasticity: The variance is constant across X
Independance: each observation is independant
Normality: Y is normally distributed for any value of X

How well did you know this?

Not at all

Perfectly

What are some feature selection methods?

L1 (look at magnitude of coefficients), remove highly correlated features, choose a model that does it for you (GBDTs, NNs), Greedy methods (start w/ 0 features and add, start will all features and reduce).

Feature importance: SHAP-values, Permutation Importance, Optuna/LGBM built in methods.

How well did you know this?

Not at all

Perfectly

How do you avoid overfitting?

Stratified Kfold cross validation or train/val/test split. Make sure this split is legit (account for data shift, watch for data leaks eg. patient_id)

Like to use lightweight models
Data augmentations
Like to add light L2 penalty to NNs
Like to train on lots of data
Like to ensemble different models + scalers
Like SVMs (harder to overfit than GBDT)

How well did you know this?

Not at all

Perfectly

What is dimensionality reduction?

Reducing the size of your data. Can help with reducing number of features, computational requirement of model training.

Once example of this is a CNN trained on ImageNet. The model basically creates a ~1K-2K dimensional vector from a 3x256x256 dimensional image. The same can be said for BERT-style language models.

How well did you know this?

Not at all

Perfectly

What is A/B testing?

Comparing 2 versions of a model/application and getting feedback on which is better. (using some sort of metric)

How well did you know this?

Not at all

Perfectly

What are some data wrangling and data cleaning steps?

Remove outliers
Data cleaning (regex matching, missing value imputation, remove duplicates)
Encoding sensitive fields (eg. PatientID, SSN)
Formatting data for input to into a SQL-like DB

How well did you know this?

Not at all

Perfectly

Can you provide an example of a data set with a non-Gaussian distribution?

Coin flips until you get heads
Distribution of income
Peak restaurant hours

How well did you know this?

Not at all

Perfectly

What is the difference between KNN + K-means?

Study These Flashcards

KNN
- Unsupervised (no training)
- Hyperparameters: n_closest_points
- Slower during inference

K-Means
- Supervised (needs training)
- Hyperparameters: n_clusters
- Faster during inference

Both use distance as measure of similarity.

What is the 80/20 rule? How important is it to model validation?

Study These Flashcards

80% training, 20% validation
Very important. Gives a measure of how well the model will generalize to unseen data.
If not computationally expensive you can do Kfold cross validation across the entire dataset.

Please explain the difference between L1 and L2 regularization methods?

Study These Flashcards

L1 (Lasso)
- Penalizes absolute value of the weights
- Introduces sparsity in your weights as they can be scaled to 0 (eg. could be used for feature selection)
- Multiple solutions

L2 (Ridge)
- Penalizes squared value of the weights
- More sensitive to outliers
- Scales weights toward 0, but never exactly 0. There is no sparsity in the weight matrix.
- One solution

Can you explain what an FP and a FN are? What would you say is better to have: too many false positives or too many false negatives?

Study These Flashcards

FP - Predict False when actually True
- Fraudulent transactions, Medical Diagnosis
FN - Predict False when actually True.
- Getting to Bus Stop on time, Predicting number of ice-cream tubs being sold on a day

In your opinion, what is better: An ensemble of 50 small decision trees or a large one?

Study These Flashcards

It depends. Would want to benchmark model accuracy and performance to say for sure
50 small (get the benefits of ensemble, different input scaling, different features)
1 large (more capacity, picks up more complex relationships between features)

Best practices in Data Science?

Study These Flashcards

Get a bit of domain knowledge before diving in
EDA (get a sense of the data)
Define a good validation scheme to evaluate model performance
Code + model versioning
Communication, creativity
Have a bit of fun

What is cross entropy?

Study These Flashcards

Cross entropy is the typical loss function that is optimized by neural networks in classification problems.

Uncertainty. Lower = less uncertain.

Explain variants of gradient descent (stochastic, batch, mini-batch)?

Study These Flashcards

stochastic - Single data point (takes long time to converge)
batch - Entire dataset (computationally expensive)
mini-batch - Part of dataset (benefits of both above)

What is NLP?

Study These Flashcards

The are of ML that focuses on building a model that understands language in the same way as humans.

What are some of the SOTA NLP models?

GPT-3.5, GPT-4, Claude 3, Gemini, LLama, Mistral 8x7B, Phi, DebertaV3

What pre-trained language model have you used?

Bert, Deberta-V3 (small-XL), Llama 7B-70B variants, Mistral-7B e5-base, mpnet (sentence transformers) pix2struct (chart derendering)

What are some NLP tasks?

Text classification, NER (Name entity recognition), POS (part of speech tagging), RAG (retrieval augmented generation), Translation, Summarization, Text generation, Text-to-image, Text-to-speech.

What is the difference between BERT and GPT style models?

BERT (Bidirectional encoder representations from transformers): Encoder + Decoder, trained on masked language modelling and next sentence prediction. GPT (Generative pre-trained transformer): Decoder only, Causal language modeling

Can you explain some of the parts of a transformer architecture?

What is the Key, Query, Value?

What is lemmatization and stemming?

Lemmatization: Converting words to their simple form. eg. Magnificent -> good. Stemming. Removing prefix or suffix. Running -> run.

What is tokenization in NLP? Can you explain common tokenizers?

The process of splitting a text into smaller pieces. Word-level, character-level, subword-level (typically subword). BPE: Byte-pair encoding. Splits into words, then tokens, and merges most frequent pairs up to vocab size. WordPiece: Same as BPE. Picks pair w/ highest likelihood. SentencePiece: Same as BPE, no split into words though.

What is TF-IDF?

What is an Hidden Markov Model?

A model used to represent a process using hidden states. At any point in time you can see the probabilities of transitioning into the next state, but there is no way to say how you got to the current state.

What is an RNN?

Recurrent neural network. GRU + LSTM. Pros: Shares parameters for each step in sequence, Works with any sequence length Cons: Vanishing gradients, Sometimes loses information earlier in sequence, Can not be processed in parallel.

What are some of the pros + cons of transformers?

Pros: Trained in Parallel, Attention (sees entire sequence), Lots of pre-trained variants Cons: Computationally expensive (GPU, training data, inference time). Due to these constraints there is a high environmental impact.

Machine Learning Flashcards

(37 cards)