Launching into Machine Learning Flashcards
Can you use categorical features from ML model training?
You can use them directly, you have to convert it to a numerical representation (ex. one-hot or multi-hot encoding)
What the 2 common types of EDA?
Univariate – explore only one feature and value distribution
Multivariate - compare multiple features and relation
What type of graphs are used during EDA?
Histograms, Scattered graphs and Heat maps
List some examples of Data Quality
What are 3 basic steps in EDA?
- Understand the data
- Clean the data
- Analysis of relationship between variables
What is MAE, MSE, RMSE and their differences?
MAE – average error
MSE – average squared error, better than MAE as it punishes larger errors
RMSE – better than MSE as it displays error in the predicted unit (ex. deviation of basketball point is 4 instead of 16 in MSE)
Why do we have regularization in logistic regression?
- Gradient vanishing or exploding problem
- Keep logits stay away from asymptotes which can halt the training
How to prevent overfitting?
Regularization and early stopping
Is confusion matrix available in AutoML?
They are available for classification model, but you need to have 10 or fewer values for the target column
Which type of data is AutoML supporting?
Tabular, Image, Video, Text
What is the maximum amount of time steps that AutoML Forecast is supporting?
3000
What is data leakage?
Data leakage is when you are using a feature that is highly correlated to the target you are trying to predict but that feature is not available during inference (ex. predict if a customer will sign-up and you are using his sign-up payment transaction for training). Model will have high performance during testing but will most probably not perform that good when it is deployed.
What is training-serving skew?
Training-serving skew is when input features used during training are different from features available during model serving (ex. train model with hourly data but only weekly data is available during model serving)
If precision and recall are good for a certain threshold for all classes (in case of multi-classification problem) except one, what would be your approach to resolve this?
You can change the threshold only for that one class.
For AutoML image and video, what is the minimum amount of videos/images of a certain class compared to amount of videos/images of other classes?
Class with lowest amount of videos should have at least 10% of training examples compared to a class with highest amount of videos.
What is necessary to do when it comes to video data preparation?
You need to assign bounding boxes (if needed) and classes (ex. select a ball and assign a label “ball”)
What parameters are used in AutoML video?
Frame rate – important for motion changes (ex. slow walking with low FPS can look like running)
Resolution – important for object tracking, recommended resolution is 256p
Prediction type – what are you trying to predict
What type of problems are you able to resolve with Tabular AutoML?
Binary classification, multi-class classification (predict one out of more than 3 classes), regression, forecasting