Machine Learning Flashcards

1
Q

Explain how SVM works

A

Finds the hyperplane which best separates classes of data. It does by maximizing the margin between the the support vectors (the hardest points to classify).

Hard Margin - Linearly seperable
Soft Margin - Allows for missclassification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the typical ML workflow?

A

Define the problem + metric
Collect data
EDA
Data Cleaning / Transformation
Define train/valid/test splits
Build baseline model
Model development
Model deployment
Model monitoring
Iterate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Pyspark? How have you used this?

A

Distributed data processing library. It also has some machine learning capability, data streaming, and a module for working with graph data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the services on Azure/GCP/AWS that are most commonly used?

A

Compute - EC2 / Compute / VMs
Storage - Storage / S3 / Blob

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain backpropagation

A

Backpropagation is how a neural network updates its weights. It does this by computing partial derivatives of a loss function with respect to each parameter in its network.

Computes these using a forward pass, a backward pass, and a weight update (gradient descent).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are some different optimizers? Explain a few of them?

A

Adam, SGD, SGD w/ Momentum, ADAN, Lion.

SGD - Minimize gradient at this point in time
Momentum - ball rolling down a hill
AdaGrad - scales LR based on magintude of parameter updates
Adam - Scales LR based on past gradients and second moments. Adjusts LR for each parameter individually.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some ways you can normalize your data?

A

StandardScaler, MinMax scaling, LogScaling, PowerTransformation, Quantile, One hot encoding.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How do you handle missing values?

A

Depends.

  • Fill w/ Mode/Mean/Median
  • Drop row/column
  • Manually input true value based on other columns or rows
  • Use model-assisted imputation (eg. K-means, KNN)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the assumptions required for linear regression?

A

Linearity: X - Y relationship is linear
Homescedasticity: The variance is constant across X
Independance: each observation is independant
Normality: Y is normally distributed for any value of X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some feature selection methods?

A

L1 (look at magnitude of coefficients), remove highly correlated features, choose a model that does it for you (GBDTs, NNs), Greedy methods (start w/ 0 features and add, start will all features and reduce).

Feature importance: SHAP-values, Permutation Importance, Optuna/LGBM built in methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you avoid overfitting?

A

Stratified Kfold cross validation or train/val/test split. Make sure this split is legit (account for data shift, watch for data leaks eg. patient_id)

  • Like to use lightweight models
  • Data augmentations
  • Like to add light L2 penalty to NNs
  • Like to train on lots of data
  • Like to ensemble different models + scalers
  • Like SVMs (harder to overfit than GBDT)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is dimensionality reduction?

A

Reducing the size of your data. Can help with reducing number of features, computational requirement of model training.

Once example of this is a CNN trained on ImageNet. The model basically creates a ~1K-2K dimensional vector from a 3x256x256 dimensional image. The same can be said for BERT-style language models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is A/B testing?

A

Comparing 2 versions of a model/application and getting feedback on which is better. (using some sort of metric)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are some data wrangling and data cleaning steps?

A

Remove outliers
Data cleaning (regex matching, missing value imputation, remove duplicates)
Encoding sensitive fields (eg. PatientID, SSN)
Formatting data for input to into a SQL-like DB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Can you provide an example of a data set with a non-Gaussian distribution?

A
  • Coin flips until you get heads
  • Distribution of income
  • Peak restaurant hours
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between KNN + K-means?

A

KNN
- Unsupervised (no training)
- Hyperparameters: n_closest_points
- Slower during inference

K-Means
- Supervised (needs training)
- Hyperparameters: n_clusters
- Faster during inference

Both use distance as measure of similarity.

17
Q

What is the 80/20 rule? How important is it to model validation?

A
  • 80% training, 20% validation
  • Very important. Gives a measure of how well the model will generalize to unseen data.
  • If not computationally expensive you can do Kfold cross validation across the entire dataset.
18
Q

Please explain the difference between L1 and L2 regularization methods?

A

L1 (Lasso)
- Penalizes absolute value of the weights
- Introduces sparsity in your weights as they can be scaled to 0 (eg. could be used for feature selection)
- Multiple solutions

L2 (Ridge)
- Penalizes squared value of the weights
- More sensitive to outliers
- Scales weights toward 0, but never exactly 0. There is no sparsity in the weight matrix.
- One solution

19
Q

Can you explain what an FP and a FN are? What would you say is better to have: too many false positives or too many false negatives?

A
  • FP - Predict False when actually True
    • Fraudulent transactions, Medical Diagnosis
  • FN - Predict False when actually True.
    • Getting to Bus Stop on time, Predicting number of ice-cream tubs being sold on a day
20
Q

In your opinion, what is better: An ensemble of 50 small decision trees or a large one?

A
  • It depends. Would want to benchmark model accuracy and performance to say for sure
  • 50 small (get the benefits of ensemble, different input scaling, different features)
  • 1 large (more capacity, picks up more complex relationships between features)
21
Q

Best practices in Data Science?

A
  • Get a bit of domain knowledge before diving in
  • EDA (get a sense of the data)
  • Define a good validation scheme to evaluate model performance
  • Code + model versioning
  • Communication, creativity
  • Have a bit of fun
22
Q

What is cross entropy?

A

Cross entropy is the typical loss function that is optimized by neural networks in classification problems.

Uncertainty. Lower = less uncertain.

23
Q

Explain variants of gradient descent (stochastic, batch, mini-batch)?

A

stochastic - Single data point (takes long time to converge)
batch - Entire dataset (computationally expensive)
mini-batch - Part of dataset (benefits of both above)

24
Q

What is NLP?

A

The are of ML that focuses on building a model that understands language in the same way as humans.

25
Q

What are some of the SOTA NLP models?

A

GPT-3.5, GPT-4, Claude 3, Gemini, LLama, Mistral 8x7B, Phi, DebertaV3

26
Q

What pre-trained language model have you used?

A

Bert, Deberta-V3 (small-XL), Llama 7B-70B variants, Mistral-7B

e5-base, mpnet (sentence transformers)

pix2struct (chart derendering)

27
Q

What are some NLP tasks?

A

Text classification, NER (Name entity recognition), POS (part of speech tagging), RAG (retrieval augmented generation), Translation, Summarization, Text generation, Text-to-image, Text-to-speech.

28
Q

What is the difference between BERT and GPT style models?

A

BERT (Bidirectional encoder representations from transformers): Encoder + Decoder, trained on masked language modelling and next sentence prediction.

GPT (Generative pre-trained transformer): Decoder only, Causal language modeling

29
Q

Can you explain some of the parts of a transformer architecture?

A
30
Q

What is the Key, Query, Value?

A
31
Q

What is lemmatization and stemming?

A

Lemmatization: Converting words to their simple form. eg. Magnificent -> good.

Stemming. Removing prefix or suffix. Running -> run.

32
Q

What is tokenization in NLP? Can you explain common tokenizers?

A

The process of splitting a text into smaller pieces. Word-level, character-level, subword-level (typically subword).

BPE: Byte-pair encoding. Splits into words, then tokens, and merges most frequent pairs up to vocab size.

WordPiece: Same as BPE. Picks pair w/ highest likelihood.

SentencePiece: Same as BPE, no split into words though.

33
Q

What is TF-IDF?

A
34
Q

What is an Hidden Markov Model?

A

A model used to represent a process using hidden states.

At any point in time you can see the probabilities of transitioning into the next state, but there is no way to say how you got to the current state.

35
Q

What is an RNN?

A

Recurrent neural network. GRU + LSTM.

Pros: Shares parameters for each step in sequence, Works with any sequence length

Cons: Vanishing gradients, Sometimes loses information earlier in sequence, Can not be processed in parallel.

36
Q

What are some of the pros + cons of transformers?

A

Pros: Trained in Parallel, Attention (sees entire sequence), Lots of pre-trained variants

Cons: Computationally expensive (GPU, training data, inference time). Due to these constraints there is a high environmental impact.

37
Q
A