All Flashcards

1
Q

List stages of experimentation and prototyping.

A
  • problem refinement,
  • data selection,
  • data exploration,
  • feature engineering,
  • model prototyping which covers algorithm selection,
  • model training, hyperparameter tuning
  • model evaluation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What solutions are available for experimentation?

A

a low code or no code solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What if this is a case of one-off development with no need to develop a retraining pipeline?

A

the validated training model and its associated metadata and artifacts are registered with model registry.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is referred to as training operationalization?

A

If the model needs to be repeatedly retrained in the future, an automated training pipeline is also developed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What happens when the model is deployed to its target environment as a service?

A

It serves predictions to various consumers in the following forms:
- online inference, which is in realtime as a Rest API;
- streaming inference, which is near realtime such as in an event processing pipeline;
- batch inference, which is offline usually integrated with your ETL processes;
- embedded inference on an embedded system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

From what Google Cloud sources Data catalog can catalog data on data assets?

A
  • BigQuery data sets, tables, and views,
  • Pub/Subtopics,
  • Dataproc Metastore services, databases, and tables.
  • Non-GCP data assets: Hive, Oracle, SQL server, Teradata, Redshift, MySQL, PostgreSQL, Looker, Tableau
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Dataplex for?

A

Dataplex’s intelligent data fabric enables organizations to centrally manage, monitor and govern their data across data lakes, data warehouses and data marts with consistent controls, thus providing access to trusted data, empowering analytics at scale.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

List advantages of Dataplex.

A
  • gives you freedom of choice to store data wherever you want for the right price and performance.
  • choose the best analytics tools for the job, including Google Cloud and open-source analytics technology such as Apache Spark and Presto.
  • lets you enforce consistent controls across your data to ensure unified security and governance.
  • take advantage of built-in data intelligence using Google’s best in class AI ML capabilities to automate much of the manual toil around data management and get access to higher-quality data as well.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is one of the core tenets of Dataplex?

A

Letting you organize and manage your data in a way that makes sense for your business without data movement or duplication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Dataplex provides built-in one-click templates for common data-management task.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is one of the biggest differentiators for Dataplex?

A

Its data-intelligence capabilities, using Google’s best in class AI ML technologies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What Analytics Hub is for?

A

Analytics Hub exchanges data analytics assets across organizations to address challenges of data reliability and cost. You can exchange data, ML models, or other analytics assets and easily publish or subscribe to shared datasets in the open, secure, and privacy-safe environment.

It is a convenient way to build a data ecosystem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

List roles in Analytics Hub.

A
  • a data publisher,
  • an exchange administrator,
  • a data subscriber.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

List Analytics Hub components.

A
  • a publisher project,
  • a subscriber project,
  • the exchange.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are data Exchanges?

A

Exchanges are collections of data and analytics assets designed for sharing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are BigQuery shared datasets?

A

Shared datasets are collections of tables and views in BigQuery defined by a data publisher and make up the unit of cross-project or cross-organizational sharing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is recommended to use with large volumes of unstructured data?

A
  • With large volumes of unstructured data, consider using Dataflow, which uses the Apache Beam programming model.
  • You can use Dataflow to convert the unstructured data into binary data formats like TFRecord, which can improve performance of data ingestion during the training process.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When Dataproc is recommended for customers?

A
  • if your organization has an investment in an Apache Spark code base and skills
  • if they have existing implementations using Hadoop with Spark to perform ETL.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Use one off Python scripts for smaller datasets that fit into memory.

A
  • True
  • If you need to perform transformations that are not expressible in Cloud SQL or are for streaming, you can use a combination of Dataflow and the pandas library.
20
Q

Autoscaling is supported in both Dataflow and Dataproc.

A

True

21
Q

Dataprep can also be considered a data preprocessing option. We’ll explore data transformation and see how data can be cleaned, structured, enriched, and validated with Dataprep by Trifacta.

A

True

22
Q

List two important types of TensorFlow parameters.

A
  • learning rate controls the size of the step in the waight space
  • batch size controls the number of samples that the gradient is calculated on.

Model performance is very sensitive to learning rate and batch size.

23
Q

What is the most common architecture for distributed training?

A

Data parallelism.

24
Q

Describe Data parallelism.

A
  • In data parallelism, you run the same model and computation on every device.
  • But train each of them using different training samples.
  • Each device computes loss and gradients based on training samples it sees.
  • Then we update the models’ parameters using these gradients.
  • The updated model is then used in the next round of computation.
25
Q

What approaches are used to update the model using gradients from various devices?

A
  • asynchronous parameter server approach
  • the synchronous allreduce approach
26
Q

List some approaches used to update the model of using gradients from various devices.

A
  • instead of training your model directly within your notebook instance, you can submit a training job from your notebook (The training job would automatically provision computing resources and deprovision those resources when the job is complete)
  • The training service can help to modularize your architecture (put your training code into a container to operate as a portable unit. ).
  • the training code can export the trained model file, thus enabling working with other AI services in a decoupled manner
  • The training service also supports reproducibility. Each training job is tracked with inputs, outputs and the container image used
  • The training service also supports distributed training, which means that you can train models across multiple nodes in parallel.
27
Q

What if a learning rate is too large?

A

A large learning rate value may result in the model learning a sub-optimal set of weights too fast or an unstable training process.

28
Q

What can happen if the value is too small?

A

Training may take a long time.

29
Q

Larger batch sizes require smaller learning rates.

A

True

30
Q

List hyperparameter tuning approaches that Vertex Vizier offers.

A
  • grid search
  • random search
  • bayesian optimization (default)
31
Q

What are the advantages of the bayesian optimization hyperparameter tuning method?

A
  • takes into account past evaluations when choosing which hyperparameter set to evaluate next.
  • typically requires fewer iterations to get the optimal set of hyperparameter values
  • limits the number of times a model needs to be trained
32
Q

What is Vertex Vizier?

A

Vertex Vizier is a black-box optimization service that helps you tune hyperparameters in complex machine learning models.

33
Q

Which of the following algorithms is useful, if you want to specify a quantity of trials that is greater than the number of points in the feasible space?

A

Grid Search

34
Q

What is a specific of batch prediction?

A

Batch prediction is asynchronous, which means that the model will wait until it processes all of the prediction requests before returning a CSV file or a BigQuery table with prediction values.

35
Q

What are specifics of online prediction?

A
  • useful if your model is part of an application, and parts of your system are dependent on a quick prediction turnaround.
  • synchronous in real time, which means that it will quickly return a prediction but only accepts one prediction request per API call
36
Q

What do you need to use a custom container to serve predictions from a custom-trained model?

A

You must provide Vertex AI with a Docker container image that runs an HTTP server

37
Q

List Vertex AI data source BigQuery requirements.

A
  • BigQuery data source tables cannot be larger than 100 gigabytes.
  • you must use a multi-regional BigQuery dataset in the US or EU locations.
  • If the table is in a different project, you must provide the BigQuery Data Editor role to the Vertex AI service account in that project.
38
Q

List Vertex AI data source requirements for CSV files

A
  • the first line of the data source must contain the name of the columns.
  • Each data source file cannot be larger than 10 gigabytes. You can include multiple files up to a maximum size of 100 gigabytes.
  • If the cloud storage bucket is in a different project where you use Vertex AI, you must provide the Storage Object Creator role to the Vertex AI service account in that project.
39
Q

What is Vertex AI model monitoring?

A

Vertex AI model monitoring is a service that helps you manage the performance of your models:
- lets you detect drift in data quality,
- identify skew in training versus serving data,
- monitor feature attribution,
- use the UI to visualize monitoring metrics.

40
Q

What is the baseline for skew detection?

A

the statistical distribution of the feature’s values in the training data.

41
Q

What is the baseline for drift detection?

A

the statistical distribution of the feature’s values seen in production in the recent past.

42
Q

What happens when a feature is monitored for training serving skew or prediction drift?

A
  • model monitoring computes the statistical distribution of the latest feature values seen in production.
  • this statistical distribution is then compared against another baseline distribution by computing a distance score to determine how similar the production feature values are to the baseline.
  • when the distance score between two statistical distributions exceeds a certain threshold, model monitoring identifies that as skew or drift.
43
Q

What are Vertex AI pipelines?

A

Vertex AI pipelines are portable and scalable ML workflows that are based on containers and Google Cloud services.

44
Q

What is recommended when you use TensorFlow in an ML workflow that processes terabytes of structured data or text data?

A

we recommend that you build your pipeline using TFX.

For other use cases, we recommend that you build your pipeline using the Kubeflow pipeline’s SDK.

45
Q

What are the best practices for model monitoring?

A
  • skew detection,
  • fine-tuning alert thresholds,
  • using feature attributions to detect data drift or skew
  • tracking outliers.
46
Q

What are the specifics of model monitoring?

A

works for structured data like numerical or categorical features
but not for unstructured data like images.