Launching into Machine Learning Flashcards

1
Q

Can you use categorical features from ML model training?

A

You can use them directly, you have to convert it to a numerical representation (ex. one-hot or multi-hot encoding)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What the 2 common types of EDA?

A

Univariate – explore only one feature and value distribution
Multivariate - compare multiple features and relation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What type of graphs are used during EDA?

A

Histograms, Scattered graphs and Heat maps

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

List some examples of Data Quality

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are 3 basic steps in EDA?

A
  1. Understand the data
  2. Clean the data
  3. Analysis of relationship between variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is MAE, MSE, RMSE and their differences?

A

MAE – average error
MSE – average squared error, better than MAE as it punishes larger errors
RMSE – better than MSE as it displays error in the predicted unit (ex. deviation of basketball point is 4 instead of 16 in MSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Why do we have regularization in logistic regression?

A
  • Gradient vanishing or exploding problem
  • Keep logits stay away from asymptotes which can halt the training
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How to prevent overfitting?

A

Regularization and early stopping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Is confusion matrix available in AutoML?

A

They are available for classification model, but you need to have 10 or fewer values for the target column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Which type of data is AutoML supporting?

A

Tabular, Image, Video, Text

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the maximum amount of time steps that AutoML Forecast is supporting?

A

3000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is data leakage?

A

Data leakage is when you are using a feature that is highly correlated to the target you are trying to predict but that feature is not available during inference (ex. predict if a customer will sign-up and you are using his sign-up payment transaction for training). Model will have high performance during testing but will most probably not perform that good when it is deployed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is training-serving skew?

A

Training-serving skew is when input features used during training are different from features available during model serving (ex. train model with hourly data but only weekly data is available during model serving)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

If precision and recall are good for a certain threshold for all classes (in case of multi-classification problem) except one, what would be your approach to resolve this?

A

You can change the threshold only for that one class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

For AutoML image and video, what is the minimum amount of videos/images of a certain class compared to amount of videos/images of other classes?

A

Class with lowest amount of videos should have at least 10% of training examples compared to a class with highest amount of videos.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is necessary to do when it comes to video data preparation?

A

You need to assign bounding boxes (if needed) and classes (ex. select a ball and assign a label “ball”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What parameters are used in AutoML video?

A

Frame rate – important for motion changes (ex. slow walking with low FPS can look like running)
Resolution – important for object tracking, recommended resolution is 256p
Prediction type – what are you trying to predict

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What type of problems are you able to resolve with Tabular AutoML?

A

Binary classification, multi-class classification (predict one out of more than 3 classes), regression, forecasting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What solutions are available to resolve forecasting problems?

A

AutoML forecasting, BigQuery ML forecasting with ARIMA +, Forecasting with Prophet

20
Q

What models are available within AutoML forecasting?

A
  • Temporal Fusion Transformers (TFT) – attention-base DNN model
  • Time Series Dense Encoder (TiDE) – based on encoder-decorder model. Fast training and inference especially for long contexts and horizonts
  • AutoML (L2L) – good for wide variety of use cases
  • Seq2Seq+ - good for initial experimentation as it has a simple architecture. Works good with shorter time budgets and data size up to 1GB.
21
Q

When to use Big Query Arima Plus?

A

When you need fast training with many quick iterations and creation of inexpensive baseline model. It uses ARIMA in the background – univariate forecasting model.

22
Q

What is the difference between Prophet and BQ ARIMA PLUS?

A

Both of them attempt to decompose time-series into trends, seasonality, holidays and combine these models predictions, Prophet is trying to fit a curve using linear or logistic model. Benefit compared do ARIMA PLUS is that you can select hardware that is used for training.

23
Q

Is Prophet univariate or multi-variate model?

A

It is a multi-variate model but GCP is offering only univariate version of it.

24
Q

What types of predictions are available for forecasting models?

A

Only batch prediction.

25
Q

What are Tabular Workflows?

A

They are pre-build MLOps pipelines within Vertex AI Pipelines (Kubeflow templates) that can be used to solve problems like:
- Feature Engineering (detect most important features and rank them and transform features to ensure consistent input for training and inference)
- Auto ML – pipelines for using AutoML models
- TabNet - pipeline for TabNet architecture (uses sequential attention)
- Wide & Deep - pipeline for using Wide & Deep architecture (jointly trains wide linear models and deep neural networks)

26
Q

When to use CSV and when Big Query as a data source for your model training?

A

Depending where you data resides in, but in general, if it doesn’t require nested fields and more complex structures CSV file is enough.

27
Q

How can you inform Vertex AI that one feature is more important than another?

A

By default, all features are equally important during training. You can manually add weight as a number between 0-10.000 to rank the importance of features. If you add them manually, you must do it for all features.

28
Q

Which AutoML tabular problem model you can export and deploy on on-premise infrastructure?

A

Both regression and classification but not forecast

29
Q

Is Vertex Explainable AI support for exported table models?

A

No, you have to serve them within Vertex AI

30
Q

What is a context window in time series?

A

How far back in the past from the current data point in time series will a model look into to create a prediction.

31
Q

What is forecast horizont in time series?

A

How far in the future will a model predict.

32
Q

How can you automatically generate Transformations in AutoML forecast?

A

Run Generate statistics.

33
Q

Difference between attribute and covariate value in forecasting?

A

Attribute represents a static value that doesn’t change over time. It can be treated like metadata that just gives additional information around the data point itself. (ex. in stock price prediction for a company, attributes are company location, sector, name, owner). Covariate at forecast time is a leading indicator. Prediction data has to be provided for each point in the forecast horizont. This attributes are highly correlated with the target value (ex. suppose you want to forecast daily electricity consumption for a city. The electricity price, temperature, and day of the week can be considered covariate features)

34
Q

What is hierarchical forecasting and its purpose?

A

Lets say you want to forecast sales of the products. Products are organized into category hierarchy. Forecast on a category level and on a product level should add up. The purpose of this is to reduce bias. There are three types of biases: total, temporal and group level bias.

35
Q

What types of data splits are available?

A

Random split: define percentages that add up to 100% (default is 80/10/10 for traing/validation/test)

Manual split: add column where every row contains following case-sensitive values TRAIN, VALIDATE, TEST

Chronological split: Split based on time based column where earliest rows will be used for testing, next rows for validation and latest rows for testing

36
Q

What is model evaluation slice?

A

Model evaluations of a specific class or a label.

37
Q

How can you use your AutoML models on Android and iOS devices?

A

You need to export your model to a TensorFlow Lite (Android) or Core ML (iOS) format and integrate it within your application.

38
Q

What are 4 basic commands in BQ ML?

A
  • CREATE OR REPLACE MODEL()
  • OPTIONS()
  • EVALUATE()
  • PREDICT()
39
Q

What models and features a supported in BQ ML?

A
40
Q

What options are available to use Explainable AI in BQ ML?

A

EXPLAIN_PREDICT() – for each prediction what feature contributed the most
GLOBAL_EXPLAIN() – create feature attributions (you have to set enable_global_explain option to TRUE when training the model)

41
Q

When creating a matrix factorization model to create a recommender system in BQ ML, what kind of feedback you can use?

A

Explicit (rating) or implicit (how long did the user stay on pdp)

42
Q

What option you must use to use hyperparameter tuning in BQ ML?

A

num_trials – how many models will be created (each model has different combination of hyperparameters)

43
Q

Explain how hyperparameter tuning works.

A

For hyperparameter that you want to configure, you are putting from-to values where for most hyperparameter value combination it will train a new model. Number of trained models depends on num_trials parameter.

44
Q

What is a prerequisite for creating matrix factorization model in BQ ML?

A

You need to define slot reservations as this algorithm might be computationally intensive.

45
Q

What is cross-validation approach that can be used when there is not much training data?

A

Every data point is important so use a cross-validation – split on training and validation dataset multiple times.

46
Q

How do you do a train, validation, test split in BQ?

A

You have to generate a unique id of each row, for example create a unique hash our of data or some id column or a combination of columns by using FARM_FINGERPRINT() method. FARM_FINGERPRINT() method generates a unique INT64 number that is always the same for the same input value. When taking moduo of a specific number like 10, you are splitting dataset into 10 buckets which you can further narrow down.