Google ML Engineer Flashcards

1
Q

What kinds of problem benefit from ML

A
  • Identification (Classification)
  • Prediction
  • Grouping (Clustering)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What problem does ML Solve

A
  • Identification (Classification)
  • Prediction
  • Grouping (Clustering)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are technical success metrics

A
  • Accuracy (Positive match/total)
  • Precision (True Positive match / (False positive + True positive)
  • Recall (True Positive match / (True positive + False negative)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are different types of model

A
  • Decision Tree
  • Deep Learning
  • Regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the steps in ML ops pipeline?

A
  • Data extraction
  • Data validation
  • Data preparation
  • Model training
  • Model evaluation
  • Model validation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which of the following comes first in a machine-learning pipeline?
- Model evalution
- Data extraction
- Data preparation
- Model training

A

Data extraction

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Which of the following is not a kind of data preparation task?
- Addressing missing data
- Removing unwanted data
- Integrating data
- Evaluating model performance

A

Evaluating model performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Which one isn’t the target feature of the label dataset
- Attributes of instances
- Structured tables
- Images
- Random data

A

Random data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are target values of label dataset

A
  • Class or category
  • Value to predict
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the classification algorithms

A
  • Logistic Regression
  • Decision Trees (End of decision trees are classifications), e.g. Random trees
  • Naive Bayes (It uses statistic and probability, it use to collate and coappear for classification)
  • Neural Networks (Deep learning network)
  • Nearest Neighbor (How near or far points are in space for classification)
  • Ensemble methods (These are different methods)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is logistic regression a classification algorithm or regression algorithm

A

Classification algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Give a example of decision tree algorithm

A

Random forest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which algorithm is based on How near or far points are in space and what type of algorithm it is?

A
  • Nearest Neighbor
  • Classification algorithm
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Which are the three regression algorithm

A
  • Linear Regression (Learn formula from series of values which represent the strait line and predict based on formula)
  • Decision Tree Regression (Using the structure of decision trees)
  • Polynomial Regression (Like to learn a function, but instead of the straight line, it could be a curve)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

You have data on the price of cars sold over the past two years. You have data on the sale price, age of the car, mileage, interior features, gas mileage, and several other features. You want to use this data to predict the sale prices of other cars. What kind of ML problem is this?
- Classification
- Regression
- Reinforcement learning
- Unsupervised learning

A

This is a regression problem because we are trying to predict a continuous value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which of the following is a process risk to successfully deploying a machine learning model? (One choice)
- Insufficiently agreed upon objectives
- Insufficient data
- Biased data
- High F1 score

A

Insufficiently agreed upon objectives

17
Q

Which are the three Unsupervised Learning algorithm

A
  • Do not use labeled data sets
  • Clustering
    • K means Clustering
  • Association rules
    • A priori algorithm
  • Dimensionality reduction
    • Principle component analysis
    • Autoencoders (compression of data)
18
Q

Which of the following isn’t the use case for Unsupervised learning? (One choice)
- Grouping and segmentation
- Data compression
- Work in game like scenario
- Anomaly detection

A

Work in game like scenario. This is part of Reinforcement Learning instead of Unsupervised learning

19
Q

Where are the characteristic of Reinforcement Learning? (multiple choice)
- Agent makes a series of choices in an environment
- Environment provides positive or negative feedback
- Trial and error, learn from feedback
- Use in Dimensionality reduction

A
  • Agent makes a series of choices in an environment
  • Environment provides positive or negative feedback
  • Trial and error, learn from feedback

Use in Dimensionality reduction is unsupervise learning problem

20
Q

Which one is security risk in ML Model Development? (One choice)
- Insufficient data
- Data quality issues (Data exploration, mis categorizing data, missing data)
- Biased data
- Data poisioning

A

Data poisoning (security risk) where the actor deliberately feeds data which is biased and has a missing data set to tweak the output of ML algorithm

21
Q

What Google Cloud service would you use to store 3 TB of raw data files in Parquet format that will be processed and then used for machine learning training?
- Cloud Storage
- Cloud SQL
- Bigtable
- Cloud Dataproc

A

Cloud Storage (storing objects such as raw data files.)

22
Q

What Vertex AI feature supports managed and user managed Jupyter Notebooks?
- Vertex AI Training
- Vertex AI FeatureStore
- Vertex AI Workbench
- Vertex AI Labeling

A

Vertex AI Workbench (support for Jupyter Notebooks.)

23
Q

You are building a deep learning network and need to preform large volumes of low precision calculations. What accelerator would you choose?
- TPU
- GPU
- Edge devices
- Kubernetes Pod

A

TPU

24
Q

Which of the following is not an example of sensitive data?
- Faces in an image
- Government issued ID number
- Notes in electronic patient records
- Tracking number for an e-commerce shipment

A

Tracking number for an e-commerce shipment

25
Q

What is a statistical object that describes central tendency and spread of values called?
- Mean
- Distribution
- Variance
- Mode

A

Distribution

26
Q

What are valid sources of data for use with Vertex AI Datasets?
- CSV files and BigQuery tables/views only
- CSV files and Parquet files only
- BigQuery tables/views only
- CSV files, BigQuery, and Bigtable

A

CSV files and BigQuery tables/views only

27
Q

When is deleting rows with missing data is not a reasonable option for handling missing data?
- When there are many rows with no missing data
- When a significant portion of datasets has missing values for some feature
- When the data is stored in Parquet format
- When the data is not in a relational database

A

When a significant portion of datasets has missing values for some feature

28
Q

What is the role of feature attributions in explaining predictions?
- Attribution prevents overfitting
- Attributions are a way of handling missing data
- Attributions are a measure of how much a feature contributes to a prediction
- Attributions are a type of data augmentation

A

Attributions are a type of data augmentation

29
Q

Which of the following are types of data augmentation used with images?
- Crop and flip
- Crop and synonym substitution
- Feature attribution
- Imputing data and Crop

A

Crop and flip

30
Q

What type of hyperparameter tuning algorithms use sequential processing and use the results from prior evaluations to inform evaluations of new hyperparameter values?
- Bayesin Search
- Random Search
- Grid Search
- Data Augmentation

A

Bayesian search updates priors or information known as the start of the evaluation

31
Q

When using distributed training in Vertex AI, what kind of node is responsible for communicating gradients between nodes?
- Worker Nodes
- Primary Replicas
- Reduction Server
- Backup Nodes

A

The purpose of the reduction server is to increase throughput by communicating gradients among worker nodes.

32
Q

You want to run TensorFlow models most efficiently in Google Cloud. What serving option would you choose?
- KubeFlow
- TensorFlow open source
- Optimized TensorFlow Runtime
- XGBoost

A

Optimized TensorFlow Runtime contains optimizations used internally at Google Cloud.

33
Q

You want to use Vertex AI monitoring to detect when newly arriving data is significantly different than recent production data. What metric would you use?
- Skew
- Drift
- Precision
- FI Score

A

Drift measures difference in distribution with recent production data.