GCP MLE Flashcards
You are building an ML model to detect anomalies in real-time sensor data. You will use Pub/Sub to handle incoming requests. You want to store the results for analytics and visualization. How should you configure the pipeline?
A. 1 = Dataflow, 2 = AI Platform, 3 = BigQuery
B. 1 = DataProc, 2 = AutoML, 3 = Cloud Bigtable
C. 1 = BigQuery, 2 = AutoML, 3 = Cloud Functions
D. 1 = BigQuery, 2 = AI Platform, 3 = Cloud Storage
A is correct as Data from sensors would be ingested to a Pub/Sub topic which would be further pre-processed using DataFlow batch streaming jobs, later AI-Platform Models is used to serve the model detections via Batch Predictions and the results can be stored in Bigquery for analysis and Visualizations using Data Studio or AI-Platform Notebooks.
B is incorrect as Apache Beam SDK used in Dataflow has integrations with Pub/Sub streaming and is recommended with Pub/Sub instead of DataProc, and BigQuery is a better choice for analytics and visualizations.
C is incorrect as Cloud Functions can’t be used for result analysis or visualization.
D is incorrect as Cloud Storage can’t be used for result analysis or visualization.
Note:
You can read JSON-formatted messages from a Pub/Sub topic and write them to a BigQuery table, but the results also are needed to be stored in bigquery for analysis, and using Bigquery you can’t have API calls for model predictions, it can be done using DataFlow jobs, hence C and D can’t be correct.
Links:
Similar problem statement: https://cloud.google.com/architecture/detecting-anomalies-in-financial-transactions
Your team is building an application for a global bank that will be used by millions of customers. You built a forecasting model that predicts customers’ account balances 3 days in the future. Your team will use the results in a new feature that will notify users when their account balance is likely to drop below $25. How should you serve your predictions?
A
1. Create a Pub/Sub topic for each user.
2. Deploy a Cloud Function that sends a notification when your model predicts that a user’s account balance will drop below the $25 threshold.
B
1. Create a Pub/Sub topic for each user.
2. Deploy an application on the App Engine standard environment that sends a notification when your model predicts that a user’s account balance will drop below the $25 threshold.
C
1. Build a notification system on Firebase.
2. Register each user with a user ID on the Firebase Cloud Messaging server, which sends a notification when the average of all account balance predictions drops below the $25 threshold.
D
1. Build a notification system on Firebase.
2. Register each user with a user ID on the Firebase Cloud Messaging server, which sends a notification when your model predicts that a user’s account balance will drop below the $25 threshold.
A is incorrect, this is a viable solution but to send the messages we would be charged for cloud function usage.
B is incorrect, App Engine is costlier than Cloud functions. This is not a cost- effective solution.
C is incorrect, as model prediction results are not involved in this solution.
D is correct, as we can register each user with a user ID on the Firebase Cloud Messaging(FCM) server, which sends a notification when your model predicts that a user’s account balance will drop below the $25 threshold. Using FCM we can send messages at no cost.
Note:
Firebase Cloud Messaging (FCM) is a cross-platform messaging solution that lets you reliably send messages at no cost. Hence we need to use firebase for this application.
Links:
Firebase Cloud Messaging (FCM): https://firebase.google.com/docs/cloud- messaging
You work for an advertising company and want to understand the effectiveness of your company’s latest advertising campaign. You have streamed 500 MB of campaign data into BigQuery. You want to query the table, and then manipulate the results of that query with a pandas dataframe in an AI Platform notebook. What should you do?
A
Use AI Platform Notebooks’ BigQuery cell magic to query the data, and ingest the results as a pandas dataframe.
B
Export your table as a CSV file from BigQuery to Google Drive, and use the Google Drive API to ingest the file into your notebook instance.
C
Download your table from BigQuery as a local CSV file, and upload it to your AI Platform notebook instance. Use pandas.read_csv to ingest he file as a pandas dataframe.
D
From a bash cell in your AI Platform notebook, use the bq extract command to export the table as a CSV file to Cloud Storage, and then use gsutil cp to copy the data into the notebook. Use pandas.read_csv to ingest the file as a pandas dataframe.
A is correct, BigQuery cell magic to query the data can be used to ingest
data into the pandas dataframe (refer link).
B is incorrect, this is a method with redundant steps and we want to only manipulate results of the query, not the entire table.
C is incorrect, as you want to only manipulate results of the query and not the entire table, hence this solution is not suitable.
D is incorrect, again this is a redundant multi-step method.
Note:
Using python bigquery client library to convert the query to dataframe is the most preferable solution, but it is not mentioned in the options.
Links:
BigQuery cell magic to query the data:
https://cloud.google.com/bigquery/docs/visualize-jupyter
Pandas dataframe using bigquery python client library:
https://cloud.google.com/bigquery/docs/visualize-jupyter#querying-and- visualizing-bigquery-data-using-pandas-dataframes
You are an ML engineer at global car manufacture. You need to build an ML model to predict car sales in different cities around the world. Which features or feature crosses should you use to train city-specific relationships between car type and the number of sales?
A
Thee individual features: binned latitude, binned longitude, and one-hot encoded car type.
B
One feature obtained as an element-wise product between latitude, longitude, and car type.
C
One feature obtained as an element-wise product between (Correct) binned latitude, binned longitude, and one-hot encoded car type.
D
Two feature crosses as an element-wise product: the first between binned latitude and one-hot encoded car type, and the second between binned longitude and one-hot encoded car type.
Note:
Kindly watch the Feature cross video before going through the solution.
A is incorrect, as here we won’t be creating features that represent regional car types, all three would be independent features.
B is incorrect, as would be using binned latitude and longitudes to capture city-specific features.
C is correct, crossing all three binned latitude, binned longitude and car types is necessary to capture region-specific car type information.
D is incorrect, creating different features with latitude and longitude will not be able to capture regional information.
Links:
Feature cross: https://developers.google.com/machine-learning/crash- course/feature-crosses/video-lecture
You work for a large technology company that wants to modernize their contact center. You have been asked to develop a solution to classify incoming calls by product so that requests can be more quickly routed to the correct support team. You have already transcribed the calls using the Speech-to-Text API. You want to minimize data preprocessing and development time. How should you build the model?
A
Use the AI Platform Training built-in algorithms to create a custom model.
B
Use AutoMlL Natural Language to extract custom entities for classification.
C
Use the Cloud Natural Language API to extract custom entities for classification.
D
Build a custom model to identify the product keywords from the transcribed calls, and then run the keywords through a classification algorithm.
A is incorrect, AI Platform doesn’t have built-in algorithms for NLP yet. (only BERT is available). But this is also a viable solution, if something like tensorflow requirement was mentioned in the question then this would have been the correct answer.
B is correct, AutoML natural language AI’s model can be trained for custom entities, based on those entities classification can be made for the incoming calls using a predefined lookup which would map entities to the product. (Actual automated IVRs are more sophisticated, but such details have not mentioned in the question, and the version mentioned in the question is a usable baseline solution)
C is incorrect, as Natural Language API will return generic entities which would not be specific to the call center.
D is incorrect, this would need a lot of development time.
Links:
AutoML Natural Language AI: https://cloud.google.com/natural- language/automl/docs
AI Platform built-in algorithms (no NLP algorithms available yet):
https://cloud.google.com/ai-platform/training/docs/algorithms
Automated IVRs (kindly research more on these lines if interested):
http://www.smartcustomerservice.com/Columns/Vendor-Views/Building-a- More-Intelligent-IVR-Through-Machine-Learning-130467.aspx
You are training a TensorFlow model on a structured dataset with 100 billion records stored in several CSV files. You need to improve the input/output execution performance. What should you do?
A
Load the data into BigQuery, and read the data from BigQuery.
B
Load the data into Cloud Bigtable, and read the data from Bigtable.
C
Convert the CSV files into shards of TFRecords, and store the data in Cloud Storage.
D
Convert the CSV files into shards of TFRecords, and store the data in the Hadoop Distributed File System (HDFS).
A is correct, BigQuery is recommended by Google to store and manipulate
structured data with over 100 billion rows, and the model can be trained with tensorflow using BigQuery TensorFlow reader. Hence, this is the best solution. (refer links)
B is incorrect, BigTable is for NoSQL data (which stores data as key-value pairs).
C is incorrect, converting the data to tfrecords will optimize the training, but bigquery here is the better option to query the structured dataset, rather than loading them using tf.data API.
D is incorrect, HDFS is not needed since we have a Cloud Storage bucket. And the reason for tfrecords is explained in option C.
Links:
Anatomy of a BigQuery Query:
https://cloud.google.com/blog/products/bigquery/anatomy-of-a-bigquery- query
End to end example for BigQuery TensorFlow reader:
https://www.tensorflow.org/io/tutorials/bigquery
As the lead ML Engineer for your company, you are responsible for building ML models to digitize scanned customer forms. You have developed a TensorFlow model that converts the scanned images into text and stores them in Cloud Storage. You need to use your ML model on the aggregated data collected at the end of each day with minimal manual intervention. What should you do?
A
Use the batch prediction functionality of AI Platform.
B
Create a serving pipeline in Compute Engine for prediction.
C
Use Cloud Functions for prediction each time a new data point is ingested.
D
Deploy the model on AI Platform and create a version of it for online inference.
A is correct, using AI-Platform batch-prediction you can specify the output
directory for storing the results, all the results would be stored there. There is no real time requirement mentioned in the question, so batch-prediction can be performed at the end of each day for inferring the results in a scalable way and aggregating them at a provided output path.
B is incorrect, this is the most inefficient way and not scalable.
C is incorrect, no such real time requirement is mentioned in the question, hence aloud functions would not be required here.
D is incorrect, if we use Online Prediction then we need to write the code to aggregate the results to a certain location , which is automatically done in batch predictions.
Note:
The path to the Cloud Storage location where you want the prediction service to save your results.
Links:
AI Platform Batch Predictions (refer output path): https://cloud.google.com/ai- platform/prediction/docs/batch-predict
You recently joined an enterprise-scale company that has thousands of datasets. You know that there are accurate descriptions for each table in BigQuery, and you are searching for the proper BigQuery table to use for a model you are building on AI Platform. How should you find the data that you need?
A
Use Data Catalog to search the BigQuery datasets by using keywords in the table description.
B
Tag each of your model and version resources on AI Platform with the name of the BigQuery table that was used for training.
C
Maintain a lookup table in BigQuery that maps the table descriptions to the table ID. Query the lookup table to find the correct table ID for the data that you need.
D
Execute a query in BigQuery to retrieve all the existing table names in your project using the INFORMATION_SCHEMA metadata tables that are native to BigQuery. Use the result o find the table that you need.
A is correct, Data Catalog offers powerful, structured search capabilities and
predicate-based filtering over both the technical and business metadata from Bigquery. Data catalog provides APIs in various languages to search the datasets. (refer link)
B is incorrect, in order to tag the BigQuery Table with the model or version resource on AI Platform you will first need to search the tables. The Search step is not mentioned in this option.
C is incorrect, again to create the lookup table, tables would need to be searched first, the process of creating a lookup table isn’t mentioned here.
D is incorrect, you would receive the metadata for all the tables using this method. But exactly how to use the results of over 1000 tables to search for the required table is not mentioned in this answer. (refer link for details about INFORMATION_SCHEMA)
Links:
Data Catalog overview: https://cloud.google.com/data- catalog/docs/concepts/overview
How to search using Data Catalog: https://cloud.google.com/data- catalog/docs/how-to/search
Getting table metadata using INFORMATION_SCHEMA:
https://cloud.google.com/bigquery/docs/information-schema-tables
You started working on a classification problem with time-series data and achieved an area under the receiver operating characteristic curve (AUC ROC) value of 99% for training data after just a few experiments. You haven’t explored using any sophisticated algorithms or spent any time on hyperparameter tuning. What should your next step be to identify and fix the problem?
A
Address the model overfitting by using a less complex algorithm.
B
Address data leakage by applying nested cross-validation during model training.
C
Address data leakage by removing features highly correlated with the target value.
D
Address the model overfitting by tuning the hyperparameters to reduce the AUC ROC value.
This is not the data leakage problem as the details on how the data is split is not mentioned, if data is splitted randomly and not w.r.t. time for a time-series problem, then it becomes a data leakage problem. Here we assume data is splitted correctly and since there is 99% ROC-AUC on training data and not on the validation data this is an overfitting issue.
A is correct, as this is an overfitting issue, the first thing we do is reduce the model complexity by either using a less complex model or using L1 regularization/dropout.
B is incorrect, as mentioned in the note, this is not a data leakage problem. C is incorrect, as mentioned in the note, this is not a data leakage problem.
D is incorrect, this is a viable solution, but in the question, it is mentioned that 99% ROC AUC is obtained just after a few experiments, so the first thing which needs to be done is to reduce the model complexity.
Links:
Overfitting explained: https://elitedatascience.com/overfitting-in-machine- learning
You work for an online travel agency that also sells advertising placements on its website to other companies. You have been asked to predict the most relevant web banner that a user should see next. Security is important to your company. The model latency requirements are 300ms, the inventory is thousands of web banners, and your exploratory analysis has shown that navigation context is a good predictor. You want to Implement the simplest solution. How should you configure the prediction pipeline?
A.
Embed the client on the website, and then deploy the model on AI Platform Prediction.
B
Embed the client on the website, deploy the gateway on App Engine, and then deploy the model on AI Platform Prediction.
C
Embed the client on the website, deploy the gateway on App Engine, deploy the database on Cloud Bigtable for writing and for reading the user’s navigation context, and then deploy the model on AI Platform Prediction.
D
Embed the client on the website, deploy the gateway on App Engine, deploy the database on Memorystore for writing and for reading the user’s navigation context, and then deploy the model on Google Kubernetes Engine.
The question asks for the simplest solution, so the module required to make predictions using AI Platform Online Prediction can be embedded within the client application itself, since the AI Platform Online Prediction model provides the REST API. If there are not many network steps involved for the prediction process the latency would be low. Amongst all the options mentioned below option A is the simplest and most viable solution for lowest latency, since there is just one network communication step involved. And current navigation context would be available at the client end itself
A is correct, as mentioned in the note, this is the simplest and most viable solution for low latency. The only latency component in this solution is the response time for Online Prediction.
B is incorrect, there are many steps involved in this solution, option A is a much simpler solution.
C is incorrect, there are many steps involved in this solution, option A is a much simpler solution. (database steps are not required, since only a banner is needed to show to the user on given instances)
D is incorrect, there are many steps involved in this solution, option A is a much simpler solution. This is actually a more viable solution than option A, since memorystore is mentioned, it can be used to cache the current navigation context of the user if due to some reason this can’t be done at the client, but here prediction is mentioned over GKE, which is a complex step. Hence, discarded.
Links:
AI Platform online prediction (refer REST API): https://cloud.google.com/ai- platform/prediction/docs/online-predict
Your team is building a convolutional neural network (CNN)-based architecture from scratch. The preliminary experiments running on your on-premises CPU-only infrastructure were encouraging, but have slow convergence. You have been asked to speed up model training to reduce time-to-market. You want to experiment with virtual machines (VMs) on Google Cloud to leverage more powerful hardware. Your code does not include any manual device placement and has not been wrapped in Estimator model-level abstraction. Which environment should you train your model on?
A
AVM on Compute Engine and 1 TPU with all dependencies installed manually.
B
AVM on Compute Engine and 8 GPUs with all dependencies installed manually.
C
A Deep Learning VM with an n1-standard-2 machine and 1 GPU with all libraries pre-installed.
D
A Deep Learning VM with more powerful CPU e2-highcpu-16 machines with all libraries pre-installed.
Since manual device placements are not written in code, nor the code is wrapped with estimator or keras model level abstraction, we have to go with faster CPUs. GPU and TPU based solutions wouldn’t be viable.
A is incorrect, as a TPU based solution.
B is incorrect, as a GPU based solution.
C is incorrect, as a GPU based solution.
D is correct, as this is a CPU based solution and Deep Learning VM has all the required ML libraries pre-installed.
Links:
GPU manual device placement:
https://www.tensorflow.org/guide/gpu#manual_device_placement
TPU manual device placement:
https://www.tensorflow.org/guide/tpu#manual_device_placement
You work on a growing team of more than 50 data scientists who all use AI Platform. You are designing a strategy to organize your jobs, models, and versions in a clean and scalable way. Which strategy should you choose?
A
Set up restrictive IAM permissions on the AI Platform notebooks so that only a single user or group can access a given instance.
B
Separate each data scientist’s work into a different project to ensure that the jobs, models, and versions created by each data scientist are accessible only to that user.
C
Use labels to organize resources into descriptive categories.
Apply a label to each created resource so that users can filter the results by label when viewing or monitoring the resources.
D
Set up a BigQuery sink for Cloud Logging logs that is appropriately filtered to capture information about AI Platform resource usage. In BigQuery, create a SQL view that maps users to the resources they are using
A is incorrect, as a strategy to organize your jobs, models, and versions in a clean and scalable way is asked for, nothing about Notebooks is mentioned in the question.
B is incorrect, in case Data scientists need to collaborate over building a model, then distributing them over multiple projects would be restrictive in the development process.
C is correct, AI Platform provides the labels to organize and filter your resources. You can label jobs by team/user and development phase (prod or test), then filter the jobs based on the team and phase.
D is incorrect, this is a viable solution. You can export all the AI Platform logs to BigQuery and write custom queries to map users to the resources they are using. This would provide an organized view over the AI Platform resources, but option C is a better solution as it provides in-built filtering, no custom table views are needed.
Links:
Labelling resources on AI-Platform: https://cloud.google.com/ai- platform/training/docs/resource-labels
Bigquery logging sink: https://cloud.google.com/logging/docs/export/bigquery
You are training a deep learning model for semantic image segmentation with reduced training time. While using a Deep Learning VM Image, you receive the following error:
The resource ‘projects/deeplearning-platforn/zones/europe-west4-
c/acceleratorTypes/nvidia-tesla-k80’ was not found
What should you do?
A
Ensure that you have GPU quota in the selected region.
B
Ensure that the required GPU is available in the selected region.
C
Ensure that you have preemptible GPU quota in the selected region.
D
Ensure that the selected GPU has enough GPU memory for the workload.
Go through the troubleshooting link first.
A is incorrect, as this is a resource not found issue, not quota exceeded issue.
B is correct, as this is a resource not found the issue. Hence one should determine which region has the required GPU.
C is incorrect, this is not a preemptible instance quote issue.
D is incorrect, this is the solution for ResourceExhaustedError. (ResourceExhaustedError generally occurs when batch-size is too high for the given machine)
Links:
Troubleshooting Deep learning VMs: https://cloud.google.com/deep-learning- vm/docs/troubleshooting
Your team is working on an NLP research project to predict the political affiliation of authors based on articles they have written. You have a large training dataset that is structured like this:
AuthorA:Political Party A
TestA1: [SentenceA11, SentenceA12, …]
TextA2: [SentenceA21, SentenceA22, …]
You want to predict the political affiliation of authors based on articles they have written. Thus we have to classify the author to the political party based on the text. Thus, splitting the dataset based on the author makes the most sense (option B). Let’s say we split data based on the texts (option A), then let’s say TextA1 goes train and TextA2 will go to Validation, now the model would correctly predict TextA2 as affiliated to Political Party A based on the writing patterns of the user and not the actual context in the text. This type of problem is a common issue in splitting medical datasets, hence medical data is always split as per the patients. This issue is ignored during Question framing and the split ratio is given more importance here, hence the correct answer would differ from this solution.
A is incorrect, as per the note.
B is correct, as per the note.
C is incorrect, as this would introduce author-specific bias too. But this would be the Correct answer if the split ratio is considered more important than the author-specific bias which would lead to data leakage during evaluation.
D is incorrect, paragraphs of texts are not mentioned to have in special significance as per the question.
Links:
Splitting medical dataset: https://www.coursera.org/lecture/ai-for-medical- diagnosis/splitting-data-by-patient-cQr8S
Your team has been tasked with creating an ML solution in Google Cloud to classify support requests for one of your platforms. You analyzed the requirements and decided to use TensorFlow to build the classifier so that you have full control of the model’s code, serving, and deployment. You will use Kubeflow pipelines for the ML platform. To save time, you want to build on existing resources and use managed services instead of building a completely new model. How should you build the classifier?
A
Use the Natural Language API to classify support requests.
B
Use AutoML Natural Language to build the support requests classifier.
C
Use an established text classification model on AI Platform to perform transfer learning.
D
Use an established text classification model on AI Platform as-is to classify support requests.
A is incorrect, the user needs the model’s code, serving, and deployment which is not possible at all with Natural Language API, since it is a pre-built API with pre-built generic entities.
B is incorrect, as tensorflow is mentioned in the question and the user needs the model’s code, serving, and deployment which is not possible using AutoML.
C is incorrect, transfer learning need is not mentioned anywhere in the question, if something like less data is available as mentioned in the question, then transfer learning would have been a correct choice.
D is correct, this is the most viable solution if the user wants to use tensorflow and needs control over the model’s code, serving, and deployment. But multiple NLP models are not supported by AI-Platform yet and will be available soon. (checkout BERT in links)
Links:
AI Platform BERT: https://cloud.google.com/ai- platform/training/docs/algorithms/bert
You recently joined a machine learning team that will soon release a new project. As a lead on the project, you are asked to determine the production readiness of the ML components. The team has already tested features and data, model development, and infrastructure. Which additional readiness check should you recommend to the team?
A
Ensure that training is reproducible.
B
Ensure that all hyperparameters are tuned.
C
Ensure that model performance is monitored.
D
Ensure that feature expectations are captured in the schema.
A is incorrect, as we need to ensure that the training is reproducible since to cope up with data drift with time, the model retraining would be required eventually. Hence this is the most important component. But this is already included in infrastructure testing.
B is incorrect, this is also important but included in the test for model development.
C is correct, as your ML system working correctly at launch, needs to continue working correctly over time, Hence monitoring is most important in production along with tests for features and data, model development, and infrastructure.
D is incorrect, as features are already tested by the ML team.
Links:
A Rubric for ML Production Readiness and Technical Debt Reduction (Paper), (Refer section V: Monitoring): https://storage.googleapis.com/pub-tools- public-publication- data/pdf/aad9f93b86b7addfea4c419b9100c6cdd26cacea.pdf
You work for a credit card company and have been asked to create a custom fraud detection model based on historical data using AutoML Tables. You need to prioritize the detection of fraudulent transactions while minimizing false positives. Which optimization objective should you use when training the model?
A
An optimization objective that minimizes Log loss
B
An optimization objective that maximizes the Precision at a Recall value of 0.50
C
An optimization objective that maximizes the area under the precision- recall curve (AUC PR) value
D
An optimization objective that maximizes the area under the receiver operating characteristic curve (AUC ROC) value
As per the question it is mentioned that we need to maximize the values of True positives and Minimize the False Positives. Nothing is mentioned about the false negatives. Hence, we need to increase the Precision (TP/(TP+FP)).
A is incorrect, an algorithm would be needed that minimizes log loss (i.e. binary cross-entropy) as this is a binary classification problem, but this wouldn’t determine the increase in precision.
B is correct, as there is a precision-recall trade-off when we reduce recall, precision would increase, thus we can keep recall to 0.5 and try to improve precision since false negatives aren’t a concern.
C is incorrect, again False Negative is not prioritized in the question, hence discarding this option. But this is an ideal solution, but difficult to achieve.
D is incorrect, AUC-ROC is the plot of sensitivity vs recall, this should be plotted when minimizing False Negatives is a priority.
Links:
Precision-recall: https://scikit- learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
ROC-AUC: https://scikit- learn.org/stable/auto_examples/model_selection/plot_roc.html
Your company manages a video sharing website where users can watch and upload videos. You need to create an ML model to predict which newly uploaded videos will be the most popular so that those videos can be prioritized on your company’s website. Which result should you use to determine whether the model is successful?
A
The model predicts videos as popular if the user who uploads them has over 10,000 likes.
B
The model predicts 97.5% of the most popular clickbait videos measured by number of clicks.
C
The model predicts 95% of the most popular videos measured by watch time within 30 days of being uploaded.
D
The Pearson correlation coefficient between the log-transformed number of views after 7 days and 30 days after publication is equal to 0.
A is incorrect, all videos don’t receive likes uniformly, so this isn’t a good metric.
B is incorrect, as actual watch time is a more important factor than the video being just clicked to be watched.
C is correct, the videos which are watched for the most time for 30 days can be considered as popular videos.
D is incorrect, if the video is unpopular after 7 days and unpopular after 30 days then too Pearson correlation coefficient would be zero, so this is not a correct option.
Note:
YouTube uses a metric called view velocity, which measures the number of subscribers who watch your video right after it’s published. And the higher your video’s view velocity, the higher your videos will rank. But, this option isn’t available.
You are working on a Neural Network-based project. The dataset provided to you has columns with different ranges. While preparing the data for model training, you discover that gradient optimization is having difficulty moving weights to a good solution. What should you do?
A
Use feature construction to combine the strongest features.
B
Use the representation transformation (normalization) technique.
C
Improve the data cleaning step by removing features with missing values.
D
Change the partitioning step to reduce the dimension of the test set and have a larger training set.
Since the dataset provided to you has columns with different ranges, hence all the features must be normalized to a fixed range. This is known as feature scaling.
A is incorrect, it is not mentioned that the model is not converging due to an excessive number of features. Model mostly overfits in this situation. Then one could try feature selection and then use dropout.
B is correct, Normalization would be needed to scale all features to a fixed range, thus making the model converge better on your data.
C is incorrect, as nothing is mentioned about the missing values in the question.
D is incorrect, again nothing is mentioned about data splitting.
Links:
Scaling for neural networks: https://machinelearningmastery.com/how-to- improve-neural-network-stability-and-modeling-performance-with-data- scaling/
Your data science team needs to rapidly experiment with various features, model architectures, and hyperparameters. They need to track the accuracy metrics for various experiments and use an API to query the metrics over time. What should they use to track and report their experiments while minimizing manual effort?
A
Use Kubeflow Pipelines to execute the experiments. Export the
metrics file, and query the results using the Kubeflow Pipelines API.
B
Use AI Platform Training to execute the experiments. Write the accuracy metrics to BigQuery, and query the results using the BigQuery API.
C
Use AI Platform Training to execute the experiments. Write the accuracy metrics to Cloud Monitoring, and query the results using the Monitoring API.
D
Use AI Platform Notebooks to execute the experiments. Collect the results in a shared Google Sheets file, and query the results using the Google Sheets API.
A is correct, You can view the uploaded metrics as a visualization in the Runs
page for a particular experiment in the Kubeflow Pipelines UI.
B is incorrect, you can write accuracy metrics to BigQuery and then use BQ APIs to compare the model results with different hyper-parameters. This is also a viable solution, but this is only discarded since the Kubeflow option was provided.
C is incorrect, monitoring is used for infrastructure and model performance monitoring over time and not for comparing the results.
D is incorrect, this is also a viable solution, but you must directly write the results to Google Sheet using APIs and not manually collect them and write to the sheet.
Links:
Kubeflow metrics:
https://www.kubeflow.org/docs/components/pipelines/sdk/pipelines-metrics/
You work for a bank and are building a random forest model for fraud detection. You have a dataset that includes transactions, of which 1% are identified as fraudulent. Which data transformation strategy would likely improve the performance of your classifier?
A. Write your data in TFRecords.
B. Z-normalize all the numeric features.
C. Oversample the fraudulent transaction 10 times.
D. Use one-hot encoding on all categorical features.
Note:
You have a dataset that includes transactions, of which 1% are identified as fraudulent. Thus this is a data imbalance problem. To handle data imbalance we use techniques like: undersampling, oversampling, ensemble modeling, augmentations, probabilistic drop during minibatch training, etc.
A is incorrect, tfrecords are used for input pipeline speed optimization in Tensorflow, not for data imbalance.
B is incorrect, z-normalization is used to handle feature scaling requirements, not for data imbalance.
C is correct, oversampling fraudulent class data will reduce data imbalance. D is incorrect, this is an encoding technique, nothing to do with data
imbalance.
Links:
Credit card fraud detection with Data imbalance:
https://towardsdatascience.com/how-to-build-a-machine-learning-model-to- identify-credit-card-fraud-in-5-stepsa-hands-on-modeling-5140b3bd19f1
Your team is using a TensorFlow Inception-v3 CNN model pretrained on ImageNet for an image classification prediction challenge on 10,000 images. You will use the AI Platform to perform the model training. What TensorFlow distribution strategy and AI Platform training job configuration should you use to train the model and optimize for wall-clock time?
A. Default Strategy; Custom tier with a single master node and four v100 GPUs.
B. One Device Strategy; Custom tier with a single master node and four v100 GPUs.
C. One Device Strategy; Custom tier with a single master node and eight v100 GPUs.
D. MirroredStrategy; Custom tier with a single master node and four v100 GPUs.
A is not correct because Default Strategy does not distribute training across multiple devices.
B is not correct because the One Device Strategy does not distribute training across multiple devices.
C is not correct because the One Device Strategy does not distribute training across multiple devices.
D is correct because this is the only strategy that can perform distributed training; albeit there is only a single copy of the variables on the CPU host.
https://www.tensorflow.org/guide/distributed_training
You work for a manufacturing company that owns a high-value machine that has several machine settings and multiple sensors. A history of the machine’s hourly sensor readings and known failure event data is stored in BigQuery. You need to predict if the machine will fail within the next 3 days in order to schedule maintenance before the machine fails. Which data preparation and model training steps should you take?
A. Data preparation: Daily max value feature engineering; Model training: AutoML classification with BQML
B. Data preparation: Daily min value feature engineering; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to True
C. Data preparation: Rolling average feature engineering; Model training: Logistic regression with BQML and AUTO_CLASS_WEIGHTS set to False
D. Data preparation: Rolling average feature engineering; Model
training: Logistic regression with BQML and (Correct) AUTO_CLASS_WEIGHTS set to True
A is not correct because a rolling average is a better feature engineering technique, as it will smooth out the noise and fluctuation in the data to demonstrate whether there is a trend. Using the max value could be an artifact of some noise and may not capture the trend accurately.
B is not correct because a rolling average is a better feature engineering technique, as it will smooth out the noise and fluctuation in the data to demonstrate whether there is a trend. Using the min value could be an artifact of some noise and may not capture the trend accurately.
C is not correct because the model training does not balance class labels for an imbalanced dataset.
D is correct because it uses the rolling average of the sensor data and balances the weights using the BQML auto class weight balance parameter.
https://cloud.google.com/dataprep/docs/html/ROLLINGAVERAGE- Function_57344753
https://cloud.google.com/dataprep/docs/html/AVERAGE-Function_57344661
https://cloud.google.com/bigquery-ml/docs/reference/standard- sql/bigqueryml-syntax-create
https://en.wikipedia.org/wiki/Precision_and_recall https://en.wikipedia.org/wiki/Sensitivity_and_specificity https://en.wikipedia.org/wiki/Moving_average
You need to build an object detection model for a small startup company to identify if and where the company’s logo appears in an image. You were given a large repository of images, some with logos and some without. These images are not yet labeled. You need to label these pictures, and then train and deploy the model. What should you do?
A. Use Google Cloud’s Data Labelling Service to label your data. Use AutoML Object Detection to train and deploy the model.
B. Use Vision API to detect and identify logos in pictures and use it as a label. Use AI Platform to build and train a convolutional neural network.
C. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a convolutional neural network.
D. Create two folders: one where the logo appears and one where it doesn’t. Manually place images in each folder. Use AI Platform to build and train a real-time object detection model.
A is correct as this will allow you to easily create a request for a labeling task
and deploy a high-performance model.
B is not correct because Vision API is not guaranteed to work with any company logos, and in the statement, it explicitly mentions a small startup, which will further decrease the chance of success.
C is not correct because the task of manually labelling the data is time consuming and should be avoided if possible.
D is not correct because the task of labeling object detection data is very tedious, and real-time object detection is designed detecting objects in videos rather than in images.
https://cloud.google.com/ai-platform/data-labeling/docs
You are developing an application on Google Cloud that will automatically generate subject labels for users’ blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do?
A. Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.
B. Call the Cloud Natural Language API from your application. Process the generated Sentiment Analysis as labels.
C. Build and train a text classification model using TensorFlow. Deploy the model using AI Platform Prediction. Call the model from your application and process the results as labels.
D. Build and train a text classification model using TensorFlow. Deploy the model using a Kubernetes Engine cluster. Call the model from your application and process the results as labels.
A is correct because it provides a managed service and a fully trained model,
and the user is pulling the entities, which is the right label.
B is not correct because sentiment is the incorrect label for this use case.
C is not correct because this requires experience with machine learning.
D is not correct because this requires experience with machine learning.
You are developing an application on Google Cloud that will label famous landmarks in users’ photos. You are under competitive pressure to develop a predictive model quickly. You need to keep service costs low. What should you do?
A. Build an application that calls the Cloud Vision API. Inspect the generated MID values to supply the image labels.
B. Build an application that calls the Cloud Vision API. Pass landmark location as base64-encoded strings.
C. Build and train a classification model with TensorFlow. Deploy the model using AI Platform Prediction. Pass client image locations as base64-encoded strings.
D. Build and train a classification model with TensorFlow. Deploy the model using AI Platform Prediction. Inspect the generated MID values to supply the image labels.
B is correct because of the requirement to quickly develop a model that
generates landmark labels from photos.
This is supported in Cloud Vision API; see the link below.
A is not correct because you should not inspect the generated MID values; instead, you should simply pass the image locations to the API and use the labels, which are output.
C, D are not correct because you should not build a custom classification TF model for this scenario.
https://cloud.google.com/vision/docs/labels
Your organization’s marketing team wants to send biweekly scheduled emails to customers that are expected to spend above a variable threshold. This is the first ML use case for the marketing team, and you have been tasked with the implementation. After setting up a new Google Cloud project, you use Vertex AI Workbench to develop model training and batch inference with an XGBoost model on the transactional data stored in Cloud Storage. You want to automate the end-to-end pipeline that will securely provide the predictions to the marketing team, while minimizing cost and code maintenance. What should you do?
A. Create a scheduled pipeline on Vertex AI Pipelines that accesses the data from Cloud Storage, uses Vertex AI to perform training and batch prediction, and outputs a file in a Cloud Storage bucket that contains a list of all customer emails and expected spending.
B. Create a scheduled pipeline on Cloud Composer that accesses the data from Cloud Storage, copies the data to BigQuery, uses BigQuery ML to perform training and batch prediction, and outputs a table in BigQuery with customer emails and expected spending.
C. Create a scheduled notebook on Vertex AI Workbench that accesses the data from Cloud Storage, performs training and batch prediction on the managed notebook instance, and outputs a file in a Cloud Storage bucket that contains a list of all customer emails and expected spending.
D. Create a scheduled pipeline on Cloud Composer that accesses the data from Cloud Storage, uses Vertex AI to perform training and batch prediction, and sends an email to the marketing team’s Gmail group email with an attachment that contains an encrypted list of all customer emails and expected spending.
A is correct because Vertex AI Pipelines and Cloud Storage are cost-effective and secure solutions. The solution requires the least number of code interactions because the marketing team can update the pipeline and schedule parameters from the Google Cloud console.
B is not correct. Cloud Composer is not a cost-efficient solution for one pipeline because its environment is always active. In addition, using BigQuery is not the most cost-effective solution.
C is not correct because the marketing team would have to enter the Vertex AI Workbench instance to update a pipeline parameter, which does not minimize code interactions.
D is not correct. Cloud Composer is not a cost-efficient solution for one pipeline because its environment is always active. Also, using email to send personally identifiable information (PII) is not a recommended approach.
https://cloud.google.com/storage/docs/encryption
https://cloud.google.com/vertex-ai/docs/pipelines/run-pipeline
https://cloud.google.com/vertex-ai/docs/workbench/managed/schedule-managed-notebooks-run-quickstart
https://cloud.google.com/arc
You have developed a very large network in TensorFlow Keras that is expected to train for multiple days. The model uses only built-in TensorFlow operations to perform training with high-precision arithmetic. You want to update the code to run distributed training using tf.distribute.Strategy and configure a corresponding machine instance in Compute Engine to minimize training time. What should you do?
A. Select an instance with an attached GPU, and gradually scale up the machine type until the optimal execution time is reached. Add MirroredStrategy to the code, and create the model in the strategy’s scope with batch size dependent on the number of replicas.
B. Create an instance group with one instance with attached GPU, and gradually scale up the machine type until the optimal execution time is reached. Add TF_CONFIG and MultiWorkerMirroredStrategy to the code, create the model in the strategy’s scope, and set up data autosharding.
C. Create a TPU virtual machine, and gradually scale up the machine type until the optimal execution time is reached. Add TPU initialization at the start of the program, define a distributed TPUStrategy, and create the model in the strategy’s scope with batch size and training steps dependent on the number of TPUs.
D. Create a TPU node, and gradually scale up the machine type until the optimal execution time is reached. Add TPU initialization at the start of the program, define a distributed TPUStrategy, and create the model in the strategy’s scope with batch size and training steps dependent on the number of TPUs.
A is not correct because it is suboptimal in minimizing execution time for model training. MirroredStrategy only supports multiple GPUs on one instance, which may not be as performant as running on multiple instances.
B is correct because GPUs are the correct hardware for deep learning training with high-precision training, and distributing training with multiple instances will allow maximum flexibility in fine-tuning the accelerator selection to minimize execution time. Note that one worker could still be the best setting if the overhead of synchronizing the gradients across machines is too high, in which case this approach will be equivalent to MirroredStrategy.
C is not correct because TPUs are not recommended for workloads that require high-precision arithmetic, and are recommended for models that train for weeks or months.
D is not correct because TPUs are not recommended for workloads that require high-precision arithmetic, and are recommended for models that train for weeks or months. Also, TPU nodes are not recommended unless required by the application.
https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus
https://www.tensorflow.org/guide/distributed_training
https://www.tensorflow.org/tutorials/distribute/multi_worker_with_ctl
You developed a tree model based on an extensive feature set of user behavioral data. The model has been in production for 6 months. New regulations were just introduced that require anonymizing personally identifiable information (PII), which you have identified in your feature set using the Cloud Data Loss Prevention API. You want to update your model pipeline to adhere to the new regulations while minimizing a reduction in model performance. What should you do?
A. Redact the features containing PII data, and train the model from scratch.
B. Mask the features containing PII data, and tune the model from the last checkpoint.
C. Use key-based hashes to tokenize the features containing PII data, and train the model from scratch.
D. Use deterministic encryption to tokenize the features containing PII data, and tune the model from the last checkpoint.
A is not correct because removing features from the model does not keep referential integrity by maintaining the original relationship between records, and is likely to cause a drop in performance.
B is not correct because masking does not enforce referential integrity, and a drop in model performance may happen. Also, tuning the existing model is not recommended because the model training on the original dataset may have memorized sensitive information.
C is correct because hashing is an irreversible transformation that ensures anonymization and does not lead to an expected drop in model performance because you keep the same feature set while enforcing referential integrity.
D is not correct because deterministic encryption is reversible, and anonymization requires irreversibility. Also, tuning the existing model is not recommended because the model training on the original dataset may have memorized sensitive information.
https://cloud.google.com/dlp/docs/transformations-reference#transformation_methods
https://cloud.google.com/dlp/docs/deidentify-sensitive-data
https://cloud.google.com/blog/products/identity-security/next-onair20-security-week-session-guide
https://cloud.google.com/dlp/docs/creating-job-triggers
You need to train an object detection model to identify bounding boxes around Post-it Notes® in an image. Post-it Notes can have a variety of background colors and shapes. You have a dataset with 1000 images with a maximum size of 1.4MB and a CSV file containing annotations stored in Cloud Storage. You want to select a training method that reliably detects Post-it Notes of any relative size in the image and that minimizes the time to train a model. What should you do?
A. Use the Cloud Vision API in Vertex AI with OBJECT_LOCALIZATION type, and filter the detected objects that match the Post-it Note category only.
B. Upload your dataset into Vertex AI. Use Vertex AI AutoML Vision Object Detection with accuracy as the optimization metric, early stopping enabled, and no training budget specified.
C. Write a Python training application that trains a custom vision model on the training set. Autopackage the application, and configure a custom training job in Vertex AI.
D. Write a Python training application that performs transfer learning on a pre-trained neural network. Autopackage the application, and configure a custom training job in Vertex AI.
A is not correct because the object detection capability of the Cloud Vision API confidently detects large objects within the image and is not the best option to reliably detect sticky notes of any relative size in the image.
B is correct because AutoML is a codeless solution that minimizes time to train and develop the model, and it is capable of detecting bounding boxes up to one percent the length of a side of an image.
C is not correct because creating a custom training job requires more development time than using AutoML does. The extra flexibility of custom training is not required because AutoML achieves state-of-the-art performance even on tiny objects (8-32 pixels). Additionally, training a model from scratch is not expected to be as performant as transfer learning.
D is not correct because creating a custom training job requires more development time than using AutoML does. The extra flexibility of custom training is not required because AutoML achieves state-of-the-art performance even on tiny objects (8-32 pixels).
https://cloud.google.com/vertex-ai/docs/start/training-methods
https://cloud.google.com/vision/automl/docs/beginners-guide#is_the_vision_api_or_automl_the_right_tool_for_me
https://cloud.google.com/vertex-ai/docs/datasets/prepare-image
https://cloud.google.com/vision-ai/docs