GCP MLE Flashcards

Question

You are developing an application on Google Cloud that will automatically generate subject labels for users’ blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do? A. Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels. B. Call the Cloud Natural Language API from your application. Process the generated Sentiment Analysis as labels. C. Build and train a text classification model using TensorFlow. Deploy the model using AI Platform Prediction. Call the model from your application and process the results as labels. D. Build and train a text classification model using TensorFlow. Deploy the model using a Kubernetes Engine cluster. Call the model from your application and process the results as labels.

Answer 1

A is correct because it provides a managed service and a fully trained model, and the user is pulling the entities, which is the right label. B is not correct because sentiment is the incorrect label for this use case. C is not correct because this requires experience with machine learning. D is not correct because this requires experience with machine learning.

Answer 2

B is correct because of the requirement to quickly develop a model that generates landmark labels from photos. This is supported in Cloud Vision API; see the link below. A is not correct because you should not inspect the generated MID values; instead, you should simply pass the image locations to the API and use the labels, which are output. C, D are not correct because you should not build a custom classification TF model for this scenario. https://cloud.google.com/vision/docs/labels

Answer 3

A is correct because Vertex AI Pipelines and Cloud Storage are cost-effective and secure solutions. The solution requires the least number of code interactions because the marketing team can update the pipeline and schedule parameters from the Google Cloud console. B is not correct. Cloud Composer is not a cost-efficient solution for one pipeline because its environment is always active. In addition, using BigQuery is not the most cost-effective solution. C is not correct because the marketing team would have to enter the Vertex AI Workbench instance to update a pipeline parameter, which does not minimize code interactions. D is not correct. Cloud Composer is not a cost-efficient solution for one pipeline because its environment is always active. Also, using email to send personally identifiable information (PII) is not a recommended approach. https://cloud.google.com/storage/docs/encryption https://cloud.google.com/vertex-ai/docs/pipelines/run-pipeline https://cloud.google.com/vertex-ai/docs/workbench/managed/schedule-managed-notebooks-run-quickstart https://cloud.google.com/arc

Answer 4

A is not correct because it is suboptimal in minimizing execution time for model training. MirroredStrategy only supports multiple GPUs on one instance, which may not be as performant as running on multiple instances. B is correct because GPUs are the correct hardware for deep learning training with high-precision training, and distributing training with multiple instances will allow maximum flexibility in fine-tuning the accelerator selection to minimize execution time. Note that one worker could still be the best setting if the overhead of synchronizing the gradients across machines is too high, in which case this approach will be equivalent to MirroredStrategy. C is not correct because TPUs are not recommended for workloads that require high-precision arithmetic, and are recommended for models that train for weeks or months. D is not correct because TPUs are not recommended for workloads that require high-precision arithmetic, and are recommended for models that train for weeks or months. Also, TPU nodes are not recommended unless required by the application. https://cloud.google.com/tpu/docs/intro-to-tpu#when_to_use_tpus https://www.tensorflow.org/guide/distributed_training https://www.tensorflow.org/tutorials/distribute/multi_worker_with_ctl

Answer 5

A is not correct because removing features from the model does not keep referential integrity by maintaining the original relationship between records, and is likely to cause a drop in performance. B is not correct because masking does not enforce referential integrity, and a drop in model performance may happen. Also, tuning the existing model is not recommended because the model training on the original dataset may have memorized sensitive information. C is correct because hashing is an irreversible transformation that ensures anonymization and does not lead to an expected drop in model performance because you keep the same feature set while enforcing referential integrity. D is not correct because deterministic encryption is reversible, and anonymization requires irreversibility. Also, tuning the existing model is not recommended because the model training on the original dataset may have memorized sensitive information. https://cloud.google.com/dlp/docs/transformations-reference#transformation_methods https://cloud.google.com/dlp/docs/deidentify-sensitive-data https://cloud.google.com/blog/products/identity-security/next-onair20-security-week-session-guide https://cloud.google.com/dlp/docs/creating-job-triggers

Answer 6

A is not correct because the object detection capability of the Cloud Vision API confidently detects large objects within the image and is not the best option to reliably detect sticky notes of any relative size in the image. B is correct because AutoML is a codeless solution that minimizes time to train and develop the model, and it is capable of detecting bounding boxes up to one percent the length of a side of an image. C is not correct because creating a custom training job requires more development time than using AutoML does. The extra flexibility of custom training is not required because AutoML achieves state-of-the-art performance even on tiny objects (8-32 pixels). Additionally, training a model from scratch is not expected to be as performant as transfer learning. D is not correct because creating a custom training job requires more development time than using AutoML does. The extra flexibility of custom training is not required because AutoML achieves state-of-the-art performance even on tiny objects (8-32 pixels). https://cloud.google.com/vertex-ai/docs/start/training-methods https://cloud.google.com/vision/automl/docs/beginners-guide#is_the_vision_api_or_automl_the_right_tool_for_me https://cloud.google.com/vertex-ai/docs/datasets/prepare-image https://cloud.google.com/vision-ai/docs

Answer 7

A is not correct because Vertex AI Workbench does not provide alerts, and you would have to log in every week to check the pipeline run status. This does not minimize monitoring steps. B is not correct because Cloud Composer does not provide ML-specific monitoring capabilities. Also, unless many pipelines are hosted in Cloud Composer, this solution is not the most cost-efficient. C is correct because using the Kubeflow Pipelines SDK is the best practice to orchestrate AI pipelines with modular steps. D is not correct because this approach requires more effort and does not follow best practices given that Vertex AI pipelines is the more suitable product to run modular containerised AI pipeline steps. https://cloud.google.com/architecture/ml-on-gcp-best-practices#machine-learning-workflow-orchestration https://cloud.google.com/vertex-ai/docs/workbench/managed/schedule-managed-notebooks-run-quickstart https://cloud.google.com/vertex-ai/docs/pipelines/schedule-cloud-scheduler

Answer 8

A is correct because using Vertex AI Feature Store with BigQuery prioritizes low latency, scalability, requires minimal maintenance, and facilitates integration with other Vertex AI services as a fully managed solution. B is not correct because feature lookup and model inference can be performed in Cloud Function, and using Google Kubernetes Engine increases maintenance. C is not correct because Vertex AI Feature Store is not as low-latency as Memorystore. D is not correct because feature lookup and model inference can be performed in Cloud Function, and using Google Kubernetes Engine increases maintenance. Also, Vertex AI Feature Store is not as low-latency as Memorystore. https://cloud.google.com/architecture/ml-on-gcp-best-practices#model-deployment-and-serving https://cloud.google.com/vertex-ai/docs/featurestore/overview#benefits https://cloud.google.com/memorystore/docs/redis/redis-overview https://cloud.google.com/vertex-ai/docs/featurestore/latest/overview#data_source_prep

Answer 9

A is correct because TensorBoard provides a compact and complete overview of training metrics such as loss and accuracy over time. If the training converges with the model’s expected accuracy, the model can be deployed. B is not correct because checking input configuration is a good test, but it is not sufficient to ensure that model performance is acceptable. You can access logs and outputs for each pipeline step to review model performance, but it would involve more steps than using TensorBoard. C is not correct because model size is a good indicator of health but does not provide a complete overview to make sure that the model can be safely deployed. Note that the pipeline’s metadata can also be accessed directly from Vertex AI Pipelines. D is not correct because data is the most probable cause of this behavior, but it is not the only possible cause. Also, access requests could take a long time and are not the most secure option. Note that the pipeline’s metadata can also be accessed directly from Vertex AI Pipelines. https://cloud.google.com/vertex-ai/docs/experiments/tensorboard-overview https://cloud.google.com/vertex-ai/docs/ml-metadata/introduction https://cloud.google.com/vertex-ai/docs/pipelines/visualize-pipeline

Answer 10

A is correct because Vertex AI Model Monitoring is a fully managed solution for monitoring training-serving skew that, by definition, requires minimal maintenance. Using the console for diagnostics is recommended for a comprehensive monitoring solution because there could be multiple causes for the skew that require manual review. B is not correct because this solution does not minimize maintenance. It involves multiple custom components that require additional updates for any schema change. C is not correct because a model retrain does not necessarily fix skew. For example, differences in pre-processing logic between training and prediction can also cause skew. D is not correct because this solution does not minimize maintenance. It involves multiple components that require additional updates for any schema change. Also, a model retrain does not necessarily fix skew. For example, differences in pre-processing logic between training and prediction can also cause skew. https://cloud.google.com/architecture/ml-modeling-monitoring-automating-server-data-skew-detection-in-ai-platform-prediction https://cloud.google.com/vertex-ai/docs/model-monitoring/overview

Answer 11

A is not correct because this approach could lead to large drops in offline performance. B is correct because this approach compensates for bias directly in the data by enhancing the data distribution of users above 60 years old. Some useful preprocessing steps could be filling null values, bucketizing, clipping outliers, sampling, or even collecting new data. C is not correct because modifying input baselines will only adjust explainability of model features and not offline model performance. D is not correct because this approach could add unconscious or implicit bias to the label. This approach is not recommended because it is a brittle solution that fixes the symptom rather than the cause. https://ai.google/responsibilities/responsible-ai-practices/ https://cloud.google.com/inclusive-ml https://developers.google.com/machine-learning/crash-course/fairness/types-of-bias https://developers.google.com/machine-learning/crash-course/classification/prediction-bias

Answer 12

A is correct because post-training quantization is the recommended option for reducing model latency when re-training is not possible. Post-training quantization can minimally decrease model performance. B is not correct because tuning the whole model on the custom dataset only will cause a drop in offline performance. C is not correct because tuning the whole model on the custom dataset only will cause a drop in offline performance. Also, pruning helps in compressing model size, but it is expected to provide less latency improvements than quantization. D is not correct because tuning the whole model on the custom dataset only will cause a drop in offline performance. Also, clustering helps in compressing model size, but it does not reduce latency. https://cloud.google.com/architecture/best-practices-for-ml-performance-cost https://www.tensorflow.org/lite/performance/model_optimization https://www.tensorflow.org/tutorials/images/transfer_learning https://cloud.google.com/vertex-ai/generative-ai/docs/models/distill-text-models

Answer 13

A is not correct because changing the learning rate does not reduce overfitting. Increasing the number of training epochs is not expected to improve the losses significantly. B is not correct because changing the learning rate does not reduce overfitting. C is not correct because increasing the number of training epochs is not expected to improve the losses significantly, and increasing the learning rate could also make the model training unstable. L1 regularization could be used to stabilize the learning, but it is not expected to be particularly helpful because only the most relevant features have been used for training. D is correct because L2 regularization prevents overfitting. Increasing the model’s complexity boosts the predictive ability of the model, which is expected to optimize loss convergence when underfitting. https://developers.google.com/machine-learning/testing-debugging/common/overview https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/l2-regularization https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/l1-regularization https://cloud.google.com/architecture/guidelines-for-developing-high-quality-ml-solutions#guidelines_for_model_quality https://www.tensorflow.org/tutorials/keras/overfit_and_underfit https://www.tensorflow.org/tensorboard/get_started https://cloud.google.com/architecture/guidelines-for-developing-high-quality-ml-solutions#guidelines_for_model_quality

Answer 14

A is not correct because canary deployments may affect user experience, even if on a small subset of users. B is correct because shadow deployments minimize the risk of affecting user experience while ensuring zero downtime. C is not correct because canary deployments may affect user experience, even if on a small subset of users. This approach is a less managed alternative to response A and could cause downtime when moving between services. D is not correct because the multi-armed bandit approach may affect user experience, even if on a small subset of users. This approach could cause downtime when moving between services. https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning#data_and_model_validation https://cloud.google.com/architecture/implementing-deployment-and-testing-strategies-on-gke https://cloud.google.com/architecture/application-deployment-and-testing-strategies#choosing_the_right_strategy https://cloud.google.com/vertex-ai/docs/general/deployment https://docs.seldon.io/projects/seldon-core/en/latest/analytics/routers.html

Answer 15

A is not correct because though Tabular Workflows for Wide & Deep is capable of handling classification and regression pipelines, it’s optimized for memorization and generalization, and in general deep learning-based models are not preferred for interpretability. B is not correct because Cloud Composer is not the right tool to build an ML pipeline quickly, and in general deep learning-based models are not preferred for interpretability. C is not correct because building a pipeline on Google Kubernetes Engine would take a long time. D is correct because TabNet uses sequential attention that promotes model interpretability and Tabular Workflows is a set of integrated, fully managed, and scalable pipelines for end-to-end ML with tabular data for regression and classification. https://cloud.google.com/vertex-ai/docs/tabular-data/tabular-workflows/overview#cr-tabnet

Answer 16

A is not correct because Cloud Storage is not optimized for accessing lots of small files, as there is overhead in establishing the connections to retrieve each file. B is not correct because although accessing a large archive via serialized records (TFRecords, WebDatasets) is faster than small files, it’s still slower than using Filestore. C is correct because Filestore is faster than Cloud Storage for accessing files, and serialized records are faster for feeding training pipelines than individual files. D is not correct because although Filestore is faster than Cloud Storage for accessing files, serialized records are still faster than individual file I/O. https://github.com/webdataset/webdataset https://cloud.google.com/blog/products/ai-machine-learning/efficient-pytorch-training-with-vertex-ai https://cloud.google.com/blog/topics/developers-practitioners/scaling-deep-learning-workloads-pytorch-xla-and-cloud-tpu-vm https://cloud.google.com/blog/topics/developers-practitioners/reading-and-storing-data-custom-model-training-vertex-ai

Answer 17

A is not correct because Cloud Functions will run into limitations based on request rate and model size. B is not correct because exposing the model as an endpoint adds to the total latency. C is correct because the RunInference API with a locally loaded model minimizes the prediction latency and makes model updates seamless. D is not correct because provisioning Vertex AI Pipelines adds to the total latency. https://cloud.google.com/dataflow/docs/notebooks/run_custom_inference https://cloud.google.com/functions/docs/tutorials/pubsub https://cloud.google.com/vertex-ai/docs/pipelines/trigger-pubsub https://cloud.google.com/functions/quotas https://cloud.google.com/architecture/minimizing-predictive-serving-latency-in-machine-learning

Answer 18

A is incorrect as shuttle stations would already be available and there is no need to predict those using an ML algorithm. B is incorrect as shuttle stations would already be available and there is no need to predict those using an ML algorithm. C is correct as all the shuttle stations which are required to be attended would be available 1 day prior from the application which is already built, hence optimal path can be determined and appropriate shuttle size can be decided and sent accordingly. D is incorrect, this method can be used if any application isn’t already available, now we go with option C.

Answer 19

A is incorrect, this might help, but 1% is very less data to effectively use upsampling techniques like SMOTE, etc. And C is a better solution. B is incorrect as using CNN with max pooling will compensate for overfitted problems but wouldn’t resolve data imbalance. C is correct as downsampling while adding more weight to downsampled data during calculating loss is used to boost the prediction score of downsampled class while training and model will converge faster. If only downsampling is done then the prediction scores for downsampled classes would be low and training would take more time to converge. (refer link) D is incorrect as it would cause loss of data. Links: Handling unbalanced datasets: https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data#downsampling-and-upweighting

Answer 20

A is incorrect. B is (in)correct as DataProc will significantly reduce the job running time for transformations and then BigQuery can be used to create the ML Models. (yes B is a viable option since u can set up dataproc to be serverless. However D is the right answer since it requires least effort and time. What do you say, as for these questions, you have to choose the answer that requires least effort.) (It should be D .... Data Fusion is not SQL syntax ....) C is incorrect, here transformation is done on Cloud SQL, which wouldn’t scale the process. D is correct. Original: D is incorrect as this process wouldn’t scale the data transformation routine. And, it is always better to transform data during ingestion. Links: GCP Doc: https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example Jupyter Notebook (Github): https://github.com/tfayyaz/cloud-dataproc/blob/master/notebooks/python/1.2.%20BigQuery%20Storage%20%26%20Spark%20SQL%20-%20Python.ipynb

Answer 21

A is correct as AI-Platform supports running training jobs with custom containers. Hence, all the available code in different frameworks can be easily containerised and they would be ready to run on AI-Platform. AI-Platform is a managed service which supports distributed training, hyper-parameter tuning, monitoring, logging and visualization with certain frameworks. B is incorrect as Kubeflow isn’t a managed service provided by GCP out of the box. It is a platform to manage/orchestrate the complicated kubernetes ML workflows. C is incorrect, as Compute Engine (VM) isn’t a managed service and it won’t make any administration work simpler. D is incorrect, this is more far from a managed service based solution. Links: https://cloud.google.com/ai-platform/prediction/docs/use-custom-container

Answer 22

A is incorrect, as it won’t give information of model performance on new data. B is correct, as the model is being re-trained, i.e. it is being actually trained on previous original train dataset and also on the new dataset, the evaluation must be performed on both original test data and new test data to validate the model performance. As new and new data is available, there would be a slight drift in the data from the original one, hence retraining is done to compensate for that drift. Various retraining strategies can be used, similar strategy which is used for training data is needed to be replicated with test data as well. C is incorrect, as the model is also trained on the original dataset, only including new data in testing would not give the correct representation of model performance. D is incorrect, there is no need to only update test data with new images when evaluation accuracy drops below a certain threshold. Note: If the model is retrained only using the new data, then too, test data shouldn’t contain only the new data, eventually we can drop older data with a certain % during evaluation.

Answer 23

A is correct as no coding would be required, exploratory data analysis, feature selection, model building, training, and hyperparameter tuning and serving is supported with AutoML Tables. B is incorrect, as to run ML classification task on BigQuery SQL commands would be needed C is incorrect, AI Platform Notebooks are generally used for experimentation involving EDA, training, tuning, but not for serving purposes. And extensive coding would be required to perform these tasks. D is incorrect, to run a classification job on AI Platform, classification code needs to be written and EDA and feature selection would need to be performed separately before running the training job. Links: AutoML Table functionalities: https://cloud.google.com/automl-tables/docs/beginners-guide

Answer 24

A is correct, Kubeflow can be used to orchestrate end-to-end ML pipelines based on Kubernetes containers. Even AI-Platform clusters can be connected to Kubeflow SDK. This is google’s recommended way to run end-to-end ML Pipeline. B is incorrect, this is also a viable solution, but ML capabilities of the model are restricted by BigQuery ML. The exact model requirements aren’t mentioned in the question for us to see whether the model can be trained using BigQuery ML, hence this option is discarded. C is incorrect, as this is a very crude way to implement the end-to-end ML pipeline. D is incorrect. Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow. It is a recommended way by Google to schedule continuous training jobs. But DataFlow isn’t used to run the training jobs. AI Platform is used for training and deployment. Note: All options are feasible here, but we have to select the best option. Links: Running ML pipelines on GCP (also refer internal links for Kubeflow): https://cloud.google.com/ai-platform/pipelines/docs/run-pipeline Cloud Composer Continuous Training jobs (for reference): https://www.coursera.org/lecture/ml-pipelines-google-cloud/what-is-cloud-composer-CuXTQ

Answer 25

A is incorrect, using Cloud function to detect changes in code stored in Google cloud storage would lead to trigger of training job at every change, it’s better to trigger training job manually only when required. B is correct, manual intervention would be very less to run the job on AI-Platform, as you need to run just one command to submit the training job. You would be submitting the training job only when you feel the model code is ready for retraining purposes. And AI-Platform only charges you for the consumed ml-units thus minimizing the cost. (B is definitely wrong because it will require manual intervention. Question specifically states the objective of minimal manual intervention. C is the way to go.) C is incorrect, as build pipeline would be triggered for each commit in Source repository, whether you want to run the build(here training) or not. It is great for Continuous deployment pipelines where application availability is priority, but not in case of ML model training. D is incorrect, again this isn’t a viable solution as checking changes in code daily may trigger non-required training jobs. Note: Tip: AI-Platform is generally recommended by GCP for custom training workflows. Links: AI Platform Submit Training jobs: https://cloud.google.com/ai-platform/training/docs/training-jobs

Answer 26

A is incorrect, Categorical Hinge loss is used in problems such as Question-Answering problems in ML when we have a difference loss function only above a certain threshold (refer link for ‘How to use Hinge loss’). B is incorrect, Binary cross-entropy is used either when there is classification amongst only two classes or when the problem is a multi-label classification problem (with multiple correct outputs). In this case the output classes are one-hot encoded and the output activation function of sigmoid is used. C is incorrect, Categorical cross-entropy is used in multiple label classification when output classes are one-hot encoded and there is only one correct label. In this case output activation function is softmax. D is correct, Sparse categorical cross-entropy is used in multiple label classification when output classes are label encoded and there is only one correct label. In this case output activation function is softmax. (Such type of problem is mentioned in the given question) Links: How to use hinge loss: https://www.machinecurve.com/index.php/2019/10/17/how-to-use-categorical-multiclass-hinge-with-keras/ Choosing loss function: https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other

Answer 27

A is incorrect, Option C would yield better results by importing user events. This would only recommend Other Products as it is the type mentioned. B is incorrect, Option C would yield better results by importing user events. This would only recommend Frequently bought together Products as it is the type mentioned. C is correct as Google’s recommended way to use Recommendation AI to create the highest quality event stream by importing your user events and product catalogs. D is incorrect, products can only be recommended by users’ behaviour. Links: Recommendation AI(Refer How It works diagram): https://cloud.google.com/recommendations

Answer 28

A is incorrect, as sentiment analysis is not a computer vision problem, it’s an NLP problem. AutoML Vision is used to train computer vision models for Image classification or Object detection on our data. B is incorrect, AutoML NLP is used to train the text-classification (here sentiment analysis) model on our own dataset without the need of writing the code for optimal model architecture. But here since no specific jargon is present Cloud NLP API would suffice. C is correct, Custom models deployed on AI-Platform can be used for Resolution time prediction and Ticket priority prediction. Cloud Natural language API is an NLP API provided by Google out of the box for powerful Text Analysis. Since Tickets doesn’t have any jargons then pretrained API can be used for sentiment analysis. D is incorrect, as sentiment analysis is not a computer vision problem, it’s an NLP problem. Cloud Vision API is a powerful visual analytics API by Google for Image analysis. Links: AutoML NLP features: https://cloud.google.com/natural-language/automl/docs/features Cloud NLP API (refer features): https://cloud.google.com/natural-language/

Answer 29

A is incorrect, insufficient data to decide upon the parameter values. B is incorrect, insufficient data to decide upon the parameter values. C is correct, L2 regularization and Dropout are used to reduce overfitting in the neural network. When we feel that model is overfitting due to less amount of training data, we go for L2 regularization and when we feel the overfitting is due to model complexity we go for dropout in neural networks. (In case of excessive features we go for L1 in traditional algorithms). Since the overfitting reason isn’t mentioned, we would run a hyper-parameter tuning job on AI-Platform to find the appropriate parameters. D is incorrect, increasing the number of neurons would worsen overfitting, since it would increase the model complexity. (Dropout is used to reduce the model complexity and make model weights more robust, this is a crude explanation, refer below link for details) Links: Statquest L2 Regularization: https://www.youtube.com/watch?v=Q81RR3yKn30 Statquest L1 Regularization: https://www.youtube.com/watch?v=NGf0voTMlcs Dropout: https://www.youtube.com/watch?v=D8PJAL-MZv8

Answer 30

A is incorrect, as poor data quality of the original data isn’t the main reason for accuracy deterioration. Retraining is needed, as the model needs to keep up with market changes. B is correct, since there is lack of retraining model isn’t keeping up with market changes (We retrain model with the new data, so that keeps with changes in market) C is incorrect, as this is not an underfitting problem, as accuracy is deteriorating with time, and any information on train accuracy is not mentioned. D is incorrect, as this wouldn’t explain the deteriorating nature of accuracy with time. Links: Why retraining is important: https://neurospace.io/blog/2019/09/why-is-retraining-so-important/

Answer 31

A is incorrect, this can prefetching can be also done with tfrecords and is more efficient with it. B is incorrect, since tfrecords are most recommended. C is incorrect, since tfrecords are most recommended. D is correct, Tfrecords with tf.data.TFRecordDataset is the most recommended way. tf.data API is optimized for tfrecords and the prefetch with tfrecords works really fast, thus the next batch of data is always effectively prefetched when the current batch is being processed by the model during training. Note: Tfrecords with tf.data is GCP’s recommended way while training a model with a huge dataset using tensorflow. Links: Tensorflow official Doc: https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset Kaggle Notebook for Tfrecords: https://www.kaggle.com/ryanholbrook/tfrecords-basics

Answer 32

Note: From the question it can be interpreted as a time series problem as the terms like historical demand and seasonal popularity are used. A is incorrect, This option is very generic B is incorrect, Reinforcement Learning -> Game AI, Industrial Automation, many more (refer link) C is correct, RNN -> Time series, NLP problems (sequential data) D is incorrect, CNN → Image data, 1D CNN can also be used in conjunction with RNN sometimes to deal with overfitting Links: RL: https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html, https://analyticsindiamag.com/top-10-free-resources-to-learn-reinforcement-learning/

Answer 33

A is incorrect, the method is correct but it doesn’t satisfy the real-time requirement of the prediction engine. B is correct, since the application is real-time, we scan the bulk of data in BigQuery while other data has been written in it. C is incorrect, as storing data first in a sensitive bucket would actually leak the PII to unauthorised users. D is incorrect, the method is correct, but isn’t real-time. Links: DLP with GCS: https://cloud.google.com/architecture/automating-classification-of-data-uploaded-to-cloud-storage DLP with BQ: https://cloud.google.com/bigquery/docs/scan-with-dlp

Answer 34

A is incorrect, there is no need to manually combine data as array, and don’t use automatic data splits, as AutoML would consider all rows independently and split the data randomly. B is incorrect, as AutoML would consider all rows independently in case of automatic data splits and split the data randomly with appropriate ratios. C is incorrect, AutoML Tables uses the earliest 80% of the rows for training, the next 10% of rows for validation, and the latest 10% of rows for testing by default when Time column is specified, but this split wouldn’t satisfy the days criteria given in question. (Eg: Let’s say the 10% validation data doesn’t have data even for 20 days then how can that be validated) D is correct, as this would satisfy the days criteria mentioned in the question. 30 days is more than 20 days, and the prediction model can be used on a validation dataset to validate the results for the next 20 days. Links: AutoML Preparing data (refer section ‘The Time Column’): https://cloud.google.com/automl-tables/docs/prepare

Answer 35

A is incorrect, this limits the rich Git CLI functionalities as we would be using script to push code. B is incorrect, cloud build not recommended for unit testing. C is correct, as cloud run can use custom containers endpoints to run in a serverless way. Pub/Sub notifications can be created from the update in the source repository, and that event topic can be used to trigger the cloud function. D is incorrect, as cloud functions can be used for unitesting with custom libraries as local packages but only for the same language. Links: Unit testing with pub/sub and cloud functions: https://cloud.google.com/functions/docs/samples/functions-pubsub-unit-test#functions_pubsub_unit_test-python Cloud Source repository with Pub/Sub: https://cloud.google.com/source-repositories/docs/pubsub-notifications Cloud Run v/s Cloud Function: https://medium.com/google-cloud/cloud-run-and-cloud-function-what-i-use-and-why-12bb5d3798e1 Packaging custom libraries as local packages in cloud function (same language): https://cloud.google.com/functions/docs/writing/specifying-dependencies-python

Answer 36

A is incorrect, less training iteration will affect model performance. B is correct, cost is not a concern as it is not mentioned in the question, the scale tier can be upgraded to significantly minimize the training time. C is incorrect, wouldn’t affect training time, but would affect model performance. D is incorrect, the model might converge faster with higher learning rate, but this would affect the training routine and might cause exploding gradients. Links: Running Training job on AI Platform: https://cloud.google.com/ai-platform/training/docs/training-jobs Scale Tier AI Platform: https://cloud.google.com/ai-platform/training/docs/machine-types

Answer 37

B: Dense layers with 100 % trainable weigts, the dropout rate at 0.25 will randomly drop 25 % for the regularization's sake - still training for 100 % of the weights. Correct answer is C. Do not forget about bias term which is also trainable parameter. C is correct. 2nd Layer with use_bias = True D is incorrect, "The Dropout Layer randomly disables neurons during training. They still are present in your model and therefore aren´t discounted from the number of parameters in your model summary." , so D is wrong , C and A takes care of the bias , but C is correct

Answer 38

D as it it google option, but now deprecated: https://cloud.google.com/ai-platform/prediction/docs/continuous-evaluation The answer is A. I am not sure why people choose B vs A as you may overfit your validation set. And you are using your held-out set really rare == no option to overfit.

Answer 39

Most likely D. A negative number in the shape enables auto expand (https://stackoverflow.com/questions/37956197/what-is-the-negative-index-in-shape-arrays-used-for-tensorflow). Then the first number -1 out of the shape (-1, 2) speaks the number of 1 dimensional arrays within the tensor (and it can autoexpand) while the second numer (2) sets the number of elements in the inner array at 2. Hence D. Having "shape=[-1,2]", the input can have as many rows as we want, but each row needs to be of 2 elements. The only option satisfying this requirement is D. D: (-1, 2) represents a vector with any number of rows but only 2 columns.

Answer 40

C. Collaborative filtering is about user similarity and product recommendations. Other models won't work Classification models (Option A) and regression models (Option D) are generally used for different types of predictive modeling tasks, not specifically for recommendations. A knowledge-based filtering model (Option B), while useful in recommendation systems, relies more on explicit knowledge about users and items, rather than on user interaction patterns and similarities, which seems to be the focus in this scenario.

Answer 41

A is correct Dataflow - Unified stream and batch data processing that's serverless, fast, and cost-effective BigQuery - Good for analytics and dashboards A - because it has BigQuery. Almost never would you see an answer that prefers CloudSQL over BQ You need to do analytics, so the answer needs to contain BigQuery and only option A does. Moreover, BigQuery is fine with SQL and Dataflow is the right tool for the processing pipeline. we need a dataflow to process data from cloud storage and data is unstructured and if we want to perform analysis on unstructured with SQL interface BIgQuery is the only option

Answer 42

D. correct. Reference: https://cloud.google.com/data-fusion

Answer 43

Precision = TruePositives / (TruePositives + FalsePositives) Recall = TruePositives / (TruePositives + FalseNegatives) A. Increase recall -> will decrease precision B. Decrease recall -> will increase precision C. Increase the false positives -> will decrease precision D. Decrease the false negatives -> will increase recall, reduce precision The correct answer is B.

Answer 44

B. Traceability, reproducibility, and explainability. Traceability: This involves maintaining records of the data, decisions, and processes used in the model. This is crucial in regulated industries for audit purposes and to ensure compliance with regulatory standards. It helps in understanding how the model was developed and how it makes decisions. Reproducibility: Ensuring that the results of the model can be reproduced using the same data and methods is vital for validating the model's reliability and for future development or debugging. Explainability: Given the significant impact of the model’s decisions on individuals' lives, it's crucial that the model's decisions can be explained in understandable terms. This is not just a best practice in AI ethics; in many jurisdictions, it's a legal requirement under regulations that mandate transparency in automated decision-making.

Answer 45

A and D : https://www.tensorflow.org/guide/data_performance , interleave and prefetch

Answer 46

I went with B. A is completely wrong. C: 1st cloud spanner is not designed for high throughput, also it is not for preprocessing. D: cloud function could not be get enough resource to do the high computational transformation. I think it's D as B is not a good choice because it requires you to run a Dataflow job for each prediction request. This is inefficient and can lead to latency issues. The issue with B is that DataFlow does not work well with high throughput

Answer 47

A Data values skews: These skews are significant changes in the statistical properties of data, which means that data patterns are changing, and you need to trigger a retraining of the model to capture these changes. https://developers.google.com/machine-learning/guides/rules-of-ml/#rule_37_measure_trainingserving_skew

Answer 48

D is correct since this question is focusing on server performance which development env is higher than production env. It's already throttling so increase the pressure on them won't help. Both A and C is essentially doing this. B is a bit mysterious, but we definitely know that D would work.

Answer 49

B. I think. BiqQuery definitely minimizes computational time for normalization. I think it would also minimize manual intervention. For data normalization in dataflow you'd have to pass in values of mean and standard deviation as a side-input. That seems more work than a simple SQL query

Answer 50

D - https://www.kubeflow.org/docs/about/use-cases/ The best approach is to create an experiment in Kubeflow Pipelines to organize multiple runs. Option A is incorrect because AutoML Tables is a managed machine learning service that automates the process of building machine learning models from tabular data. It does not provide the flexibility to customize the model architecture or explore multiple model architectures. Option B is incorrect because Cloud Composer is a managed workflow orchestration service that can be used to automate machine learning workflows. However, it does not provide the same level of flexibility or scalability as Kubeflow Pipelines. Option C is incorrect because running multiple training jobs on AI Platform with similar job names will not allow you to easily organize and compare the results.

Answer 51

D. Kubeflow pipelines have different types of components, ranging from low- to high-level. They have a ComponentStore that allows you to access prebuilt functionality from GitHub. Not sure what is the reason behind putting A as it is manual and manual steps can not be part of automation. I would say Answer is D as it just require a clone of the component from github. Using a Python and import bigquery component may sounds good too, but ask was what is easiest. It depends how word "easy" is taken by individuals but definitely not A. Very confused as to why D is the correct answer. To me it seems a) much simpler to just write a couple of lines of python (https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-install-python) and b) the documentation for the BigQuery reusable component (https://v0-5.kubeflow.org/docs/pipelines/reusable-components/) states that the data is written to Google Cloud Storage, which means we have to write the fetching logic in the next pipeline step, going against the "as simple as possible" requirement. Would be interested to hear why I am wrong.

GCP MLE Flashcards

(77 cards)