Machine Learning Flashcards
Explain how SVM works
Finds the hyperplane which best separates classes of data. It does by maximizing the margin between the the support vectors (the hardest points to classify).
Hard Margin - Linearly seperable
Soft Margin - Allows for missclassification
What is the typical ML workflow?
Define the problem + metric
Collect data
EDA
Data Cleaning / Transformation
Define train/valid/test splits
Build baseline model
Model development
Model deployment
Model monitoring
Iterate
What is Pyspark? How have you used this?
Distributed data processing library. It also has some machine learning capability, data streaming, and a module for working with graph data.
What are the services on Azure/GCP/AWS that are most commonly used?
Compute - EC2 / Compute / VMs
Storage - Storage / S3 / Blob
Explain backpropagation
Backpropagation is how a neural network updates its weights. It does this by computing partial derivatives of a loss function with respect to each parameter in its network.
Computes these using a forward pass, a backward pass, and a weight update (gradient descent).
What are some different optimizers? Explain a few of them?
Adam, SGD, SGD w/ Momentum, ADAN, Lion.
SGD - Minimize gradient at this point in time
Momentum - ball rolling down a hill
AdaGrad - scales LR based on magintude of parameter updates
Adam - Scales LR based on past gradients and second moments. Adjusts LR for each parameter individually.
What are some ways you can normalize your data?
StandardScaler, MinMax scaling, LogScaling, PowerTransformation, Quantile, One hot encoding.
How do you handle missing values?
Depends.
- Fill w/ Mode/Mean/Median
- Drop row/column
- Manually input true value based on other columns or rows
- Use model-assisted imputation (eg. K-means, KNN)
What are the assumptions required for linear regression?
Linearity: X - Y relationship is linear
Homescedasticity: The variance is constant across X
Independance: each observation is independant
Normality: Y is normally distributed for any value of X
What are some feature selection methods?
L1 (look at magnitude of coefficients), remove highly correlated features, choose a model that does it for you (GBDTs, NNs), Greedy methods (start w/ 0 features and add, start will all features and reduce).
Feature importance: SHAP-values, Permutation Importance, Optuna/LGBM built in methods.
How do you avoid overfitting?
Stratified Kfold cross validation or train/val/test split. Make sure this split is legit (account for data shift, watch for data leaks eg. patient_id)
- Like to use lightweight models
- Data augmentations
- Like to add light L2 penalty to NNs
- Like to train on lots of data
- Like to ensemble different models + scalers
- Like SVMs (harder to overfit than GBDT)
What is dimensionality reduction?
Reducing the size of your data. Can help with reducing number of features, computational requirement of model training.
Once example of this is a CNN trained on ImageNet. The model basically creates a ~1K-2K dimensional vector from a 3x256x256 dimensional image. The same can be said for BERT-style language models.
What is A/B testing?
Comparing 2 versions of a model/application and getting feedback on which is better. (using some sort of metric)
What are some data wrangling and data cleaning steps?
Remove outliers
Data cleaning (regex matching, missing value imputation, remove duplicates)
Encoding sensitive fields (eg. PatientID, SSN)
Formatting data for input to into a SQL-like DB
Can you provide an example of a data set with a non-Gaussian distribution?
- Coin flips until you get heads
- Distribution of income
- Peak restaurant hours