Launching into Machine Learning Flashcards

Question

What are Tabular Workflows?

Answer 1

They are pre-build MLOps pipelines within Vertex AI Pipelines (Kubeflow templates) that can be used to solve problems like: - Feature Engineering (detect most important features and rank them and transform features to ensure consistent input for training and inference) - Auto ML – pipelines for using AutoML models - TabNet - pipeline for TabNet architecture (uses sequential attention) - Wide & Deep - pipeline for using Wide & Deep architecture (jointly trains wide linear models and deep neural networks)

Answer 2

Depending where you data resides in, but in general, if it doesn’t require nested fields and more complex structures CSV file is enough.

Answer 3

By default, all features are equally important during training. You can manually add weight as a number between 0-10.000 to rank the importance of features. If you add them manually, you must do it for all features.

Answer 4

Both regression and classification but not forecast

Answer 5

No, you have to serve them within Vertex AI

Answer 6

How far back in the past from the current data point in time series will a model look into to create a prediction.

Answer 7

How far in the future will a model predict.

Answer 8

Run Generate statistics.

Answer 9

Attribute represents a static value that doesn’t change over time. It can be treated like metadata that just gives additional information around the data point itself. (ex. in stock price prediction for a company, attributes are company location, sector, name, owner). Covariate at forecast time is a leading indicator. Prediction data has to be provided for each point in the forecast horizont. This attributes are highly correlated with the target value (ex. suppose you want to forecast daily electricity consumption for a city. The electricity price, temperature, and day of the week can be considered covariate features)

Answer 10

Lets say you want to forecast sales of the products. Products are organized into category hierarchy. Forecast on a category level and on a product level should add up. The purpose of this is to reduce bias. There are three types of biases: total, temporal and group level bias.

Answer 11

**Random split**: define percentages that add up to 100% (default is 80/10/10 for traing/validation/test) **Manual split**: add column where every row contains following case-sensitive values TRAIN, VALIDATE, TEST **Chronological split**: Split based on time based column where earliest rows will be used for testing, next rows for validation and latest rows for testing

Answer 12

Model evaluations of a specific class or a label.

Answer 13

You need to export your model to a TensorFlow Lite (Android) or Core ML (iOS) format and integrate it within your application.

Answer 14

* CREATE OR REPLACE MODEL() * OPTIONS() * EVALUATE() * PREDICT()

Answer 15

**EXPLAIN_PREDICT()** – for each prediction what feature contributed the most **GLOBAL_EXPLAIN()** – create feature attributions (you have to set enable_global_explain option to TRUE when training the model)

Answer 16

Explicit (rating) or implicit (how long did the user stay on pdp)

Answer 17

**num_trials** – how many models will be created (each model has different combination of hyperparameters)

Answer 18

For hyperparameter that you want to configure, you are putting from-to values where for most hyperparameter value combination it will train a new model. Number of trained models depends on num_trials parameter.

Answer 19

You need to define slot reservations as this algorithm might be computationally intensive.

Answer 20

Every data point is important so use a cross-validation – split on training and validation dataset multiple times.

Answer 21

You have to generate a unique id of each row, for example create a unique hash our of data or some id column or a combination of columns by using FARM_FINGERPRINT() method. FARM_FINGERPRINT() method generates a unique INT64 number that is always the same for the same input value. When taking moduo of a specific number like 10, you are splitting dataset into 10 buckets which you can further narrow down.

Launching into Machine Learning Flashcards

(46 cards)