LINUX ACADEMY Google Cloud Data Engineer - Final Exam Flashcards
As part of your backup plan, you create regular boot-disk snapshots of Compute Engine instances that are running. You want to be able to restore these snapshots using the fewest possible steps for replacement instances. What should you do?
A
Export the snapshots to Cloud Storage. Create images from the exported snapshot files.
B
Use the snapshots to create replacement disks. Use the disks to create instances as needed.
C
Use the snapshots to create replacement instances as needed.
D
Export the snapshots to Cloud Storage. Create disks from the exported snapshot files. Create images from the new disks.
Correct Answer: C
Why is this correct?
Snapshots let you recreate instances in the fewest steps.
You currently have a Bigtable instance you’ve been using for development running a development instance type, using HDD’s for storage. You are ready to upgrade your development instance to a production instance for increased performance. You also want to upgrade your storage to SSD’s as you need maximum performance for your instance. What should you do?
A
Export your Bigtable data into a new instance, and configure the new instance type as production with SSD’s
B
Upgrade your development instance to a production instance, and switch your storage type from HDD to SSD.
C
Run parallel instances where one instance is using HDD and the other is using SSD.
D
Use the Bigtable instance sync tool in order to automatically synchronize two different instances, with one having the new storage configuration.
Correct Answer:
A
Export your Bigtable data into a new instance, and configure the new instance type as production with SSD’s
Why is this correct?
Since you cannot change the disk type on an existing Bigtable instance, you will need to export/import your Bigtable data into a new instance with the different storage type. You will need to export to Cloud Storage then back to Bigtable again. https://linuxacademy.com/cp/courses/lesson/course/2111/lesson/2/module/208
D
incorrect
This is not an option that exists in Bigtable. https://linuxacademy.com/cp/courses/lesson/course/2111/lesson/2/module/208
Which of these statements do not apply to preemptible worker nodes on Cloud Dataproc? Choose two answers.
A
You must have a max of 2:1 ratio of preemptible to standard workers.
B
Preemptible workers only function as processing nodes.
C
Your cluster can be created with only preemptible workers
D
Preemptible workers can be added after the cluster is created.
Correct Answer:
A
You must have a max of 2:1 ratio of preemptible to standard workers.
Why is this correct?
There is no ratio requirement, but be aware that preemptible workers can be reclaimed at any time, and you will want a number of standard workers that are always persistent.
Video for reference: Configure Dataproc Cluster and Submit Job – Part 1
Correct Answer:
C
Your cluster can be created with only preemptible workers
Why is this correct?
You must have at least one standard worker in a cluster.
Video for reference: Configure Dataproc Cluster and Submit Job – Part 1
You are building storage for files for a data pipeline on Google Cloud. You want to support JSON files. The schema of these files will occasionally change. Your analyst teams will use running aggregate ANSI SQL queries on this data. What should you do?
A
Use Cloud Storage for storage. Link data as permanent tables in BigQuery and turn on the Automatically detect option in the Schema section of BigQuery.
B
Use BigQuery for storage. Provide format files for data load. Update the format files as needed.
C
Use BigQuery for storage. Select Automatically detect in the Schema section.
D
Use Cloud Storage for storage. Link data as temporary tables in BigQuery and turn on the Automatically detect option in the Schema section of BigQuery.
Correct Answer:
C
Use BigQuery for storage. Select Automatically detect in the Schema section.
Why is this correct?
This is correct because of the requirement to support occasionally (schema) changing JSON files and aggregate ANSI SQL queries; you need to use BigQuery, and it is quickest to use Automatically detect for schema changes. https://linuxacademy.com/cp/courses/lesson/course/2238/lesson/3/module/208
Your online shopping company needs to know when a user has not interacted with the site in 30 minutes. They need the website to alert the user once they have been idle for too long. You use Cloud Dataflow to process the interaction events and decide if an alert should be sent. How should you design the pipeline?
A
Implement a session window with a gap time duration of 30 minutes.
B
Implement a fixed-time window with a duration of 30 minutes.
C
Implement a global window with a time-based trigger with a delay of 30 minutes.
D
Implement a sliding time window with a duration of 30 minutes.
Correct Answer:
A
Implement a session window with a gap time duration of 30 minutes.
Why is this correct?
You need a window to be based around the last activity event, which a session window provides.
Incorrect Answer:
D
Why is this incorrect?
You need a window to be based around the last activity event, which a session window provides.
What is the purpose of hyperparameters in a machine learning training model?
A
Form the basis of labels on your training data.
B
Hyperparameters adjust the training process itself.
C
Train for a regression machine learning problem.
D
They help your model learn from the training data.
Correct Answer:
B
Hyperparameters adjust the training process itself.
Why is this correct?
Learning rate and hidden layers (hyperparameters) are variables that adjust the learning model but have no relation to the training data used. https://linuxacademy.com/cp/courses/lesson/course/2246/lesson/2/module/208
You are developing an application that will only recognize and tag specific business to business product logos in images. You do not have an extensive background working with machine learning models, but need to get your application working. What is the current best method to accomplish this task?
A
Create a custom machine learning model to recognize specific logos in photos, then train it on AI Platform.
B
Use the Cloud Vision API to recognize logos in the images.
C
Use the AutoML Vision service to train a custom model using the Vision API
D
Use the Cloud Vision API to recognize all logos in images, then use the Cloud Natural Language API to recognize specific logos by name.
Correct Answer:
C
Use the AutoML Vision service to train a custom model using the Vision API
Why is this correct?
The newly added AutoML services allow you to train custom image (and other models) using the Google’s pre-trained API’s as a base. Training a custom model also works on AI Platform, but this route requires less manual model overhead. https://linuxacademy.com/cp/courses/lesson/course/2248/lesson/1/module/208
Incorrect Answer:
B
Use the Cloud Vision API to recognize logos in the images.
Why is this incorrect?
Cloud Vision API can recognize logos, however, to narrow it down to specific logos by business category you would need to train a custom machine learning model on Cloud ML Engine. https://linuxacademy.com/cp/courses/lesson/course/2248/lesson/1/module/208
You are setting up multiple MySQL databases on Compute Engine. You need to collect logs from your MySQL applications for audit purposes. How should you approach this?
A
Configure Cloud Composer to monitor and report on instance performance metrics.
B
Install the Stackdriver Logging agent on your database instances and configure the fluentd plugin to read and export your MySQL logs into Stackdriver Logging.
C
Install the Stackdriver Monitoring agent on your instances, configure the MySQL plugin, and export logs to Stackdriver Monitoring.
D
Configure Stackdriver Logging to natively monitor application logs, which will appear in Stackdriver Logging.
Correct Answer:
B
Install the Stackdriver Logging agent on your database instances and configure the fluentd plugin to read and export your MySQL logs into Stackdriver Logging.
Why is this correct?
The Stackdriver Logging agent requires the fluentd plugin to be configured to read logs from your database application.
Incorrect Answer:
D
Configure Stackdriver Logging to natively monitor application logs, which will appear in Stackdriver Logging.
Why is this incorrect?
Stackdriver Logging requires the logging agent to be installed and configured to read application logs.
You are building a data pipeline on Google Cloud. You need to select services that will host a deep neural network machine learning model also hosted on Google Cloud. You also need to monitor and run jobs that could occasionally fail. What should you do?
A
Use the Cloud Machine Learning Engine to host your model. Monitor the status of the Jobs object for ‘failed’ job states.
B
Use the Cloud Machine Learning Engine to host your model. Monitor the status of the Operation object for ‘error’ results.
C
Use a Kubernetes Engine cluster to host your model. Monitor the status of the Jobs object for ‘failed’ job states.
D
Use a Kubernetes Engine cluster to host your model. Monitor the status of the Operation object for ‘error’ results.
Correct Answer:
A
Use the Cloud Machine Learning Engine to host your model. Monitor the status of the Jobs object for ‘failed’ job states.
Why is this correct?
Cloud Machine Learning Engine is the correct service for deep neural network models. You would correctly monitor Jobs for failures. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208
You have data stored in a Cloud Storage bucket and also in a BigQuery dataset. You need to secure the data and provide 3 different types of access levels for your Google Cloud Platform users: administrator, read/write, and read-only. You want to follow Google-recommended practices. What should you do?
A
At the Organization level, add your administrator user accounts to the Owner role, add your read/write user accounts to the Editor role, and add your read-only user accounts to the Viewer role.
B
At the Project level, add your administrator user accounts to the Owner role, add your read/write user accounts to the Editor role, and add your read-only user accounts to the Viewer role.
C
Create 3 custom IAM roles with appropriate policies for the access levels needed for Cloud Storage and BigQuery. Add your users to the appropriate roles.
D
Use the appropriate pre-defined IAM roles for each of the access levels needed for Cloud Storage and BigQuery. Add your users to those roles for each of the services.
Correct Answer:
D
Use the appropriate pre-defined IAM roles for each of the access levels needed for Cloud Storage and BigQuery. Add your users to those roles for each of the services.
Why is this correct?
The principle of least privilege favors using pre-defined roles for granular access. It is also best practice to use pre-created roles over custom roles with associated policies when they match your requirements.
11. You have a long-running, streaming Dataflow pipeline that you need to shut down. You do not need to preserve data currently in the processing pipeline and need it shut down as soon as possible. Which shutdown option should you use to complete the shutdown process? A Graceful shutdown B Cancel C Stop D Drain
Correct Answer: B Cancel Why is this correct? Cancel will shut down the pipeline without allowing buffered jobs to complete. https://linuxacademy.com/cp/courses/lesson/course/2243/lesson/5/module/208
Incorrect Answer: C Stop Why is this incorrect? This is not a valid option. https://linuxacademy.com/cp/courses/lesson/course/2243/lesson/5/module/208
What open source software is Cloud Pub/Sub most similar to? A Apache Beam B Apache Kafka C HBase D Apache Hadoop
Correct Answer: B Apache Kafka Why is this correct? Kafka is the open source streaming ingest framework for creating a manual streaming pipeline. https://linuxacademy.com/cp/courses/lesson/course/2241/lesson/2/module/208
You have hundreds of IoT devices that generate 1 TB of streaming data per day. Due to latency, messages will often be delayed compared to when they were generated. You must be able to account for data arriving late within your processing pipeline. What should you do?
A
Use Cloud SQL to process the delayed messages.
B
Enable your IoT devices to generate a timestamp when sending messages. Use Cloud Dataflow to process messages, and use windows, watermarks (timestamp), and triggers to process late data.
C
Use SQL queries in BigQuery to analyze data by timestamp.
D
Enable your IoT devices to generate a timestamp when sending messages. Use Cloud Pub/Sub to process messages by timestamp and fix out of order issues.
Correct Answer:
B
Enable your IoT devices to generate a timestamp when sending messages. Use Cloud Dataflow to process messages, and use windows, watermarks (timestamp), and triggers to process late data.
Why is this correct?
Dataflow is the service that corrects out of order messages. https://linuxacademy.com/cp/courses/lesson/course/2243/lesson/3/module/208
Your Answer: D
Why is this incorrect?
Pub/Sub does not care about message order; you would use Dataflow to process out of order messages by timestamp. https://linuxacademy.com/cp/courses/lesson/course/2243/lesson/3/module/208
You work at a very large organizations that has a very large analyst team. You use the default pricing model for BigQuery. During heavy usage, your analyst group occasionally runs out of the 2000 slots available for the BigQuery jobs. You do not want to create additional projects for the sole purpose of increasing slot count. What can you do to resolve this?
A
You must create an additional project to increase your slot count, then spread the BigQuery loads across both projects.
B
Force-enable the ‘use cached results’ option for all available queries.
C
Switch to flat rate pricing to enable a higher total slot quota for your project.
D
Use the quotas page to increase your BigQuery slot count to 3000 as needed.
Correct Answer:
C
Switch to flat rate pricing to enable a higher total slot quota for your project.
Why is this correct?
Flat rate pricing allows you to set a higher slot limit.
You are an administrator for several organizations in the same company. Each organization has data in their own BigQuery table within a single project. For application access reasons, all of the tables must remain in the same project. You think each organization should be able to view and run queries against their own data without exposing the data of organizations to unauthorized viewers. What should you recommend?
A
You must separate the tables by project, and use a service account in your application to access data in each project. Give out project-wide roles to each organization.
B
Place the tables in a single dataset, and apply IAM roles to each table, limiting access per table to each organization.
C
Create a separate dataset for each organization in the same project. Place each organization’s table in each dataset. Restrict access to the organization’s dataset to only that company, from which they can view their table but no one else’s.
D
Place all data in a single table, create authorized views restricting access by row based on the SESSION_USER() field. Add that same SESSION_USER() field with the same email addresses according to which company needs access to which roles.
Correct Answer:
C
Create a separate dataset for each organization in the same project. Place each organization’s table in each dataset. Restrict access to the organization’s dataset to only that company, from which they can view their table but no one else’s.
Why is this correct?
You can assign roles at the dataset level. Placing tables in different datasets allows you to limit access per dataset. https://linuxacademy.com/cp/courses/lesson/course/2238/lesson/1/module/208
You are training a machine learning model to predict the likelihood of rain based on an available dataset of weather data. In reviewing your input data, the amount of humidity in the air has a very strong influence on the chance of rain, especially compared to less relevant data. How can you incorporate this more important data so that it properly influences the model?
A
Create a feature from the humidity data point, and use L2 regularization to optimize the model.
B
Tune your hyperparameters to give greater weighting to the humidty feature over others.
C
Create a feature from the humidity data point, and use L1 regularization to optimize the model.
D
Reduce your epochs except for humidity features.
Correct Answer:
C
Create a feature from the humidity data point, and use L1 regularization to optimize the model.
Why is this correct?
L1 regularization is able to reduce the weights of less important features to zero or near zero.
A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression. The key difference between these two is the penalty term. Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the highlighted part represents L2 regularization element. Image for post Cost function Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid over-fitting issue. Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function. Image for post Cost function Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence it will under-fit. The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features. Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection work well with a small set of features but these techniques are a great alternative when we are dealing with a large set of features.
You are building a machine learning model to predict the number of lightning strikes during a storm. Your model has thousands of input features to train on. You want to improve the training speed of the model by removing features, but do not want to negatively effect your model’s accuracy. What action should you take?
A
Combine highly co-dependent and redundant features into one representative feature.
B
Implement L2 regularization to automatically ‘prune’ unneeded features
C
Remove the features that have null values for the majority of your records.
D
Remove features that have high correlation to your output labels.
Correct Answer:
A
Combine highly co-dependent and redundant features into one representative feature.
Why is this correct?
Combining co-dependent and redundant features allows you to reduce the total number of features trained without sacrificing accuracy.
When training a machine learning model on AI Platform on a distributed scaled tier, what types of machines are part of that distributed resource? (Choose all that apply) A Host B Worker C Master D Parameter server
Correct Answer:
B
Worker
Why is this correct?
You can have multiple workers, which divide up the work of training the model. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208
Correct Answer:
C
Master
Why is this correct?
You have a single Master instance per scaled tier. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208
Correct Answer:
D
Parameter server
Why is this correct?
Parameter servers coordinate shared model states between the workers. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208
Incorrect Answer: A Host Why is this incorrect? This is not one of the scale tier machine types. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208
Your company needs to run analytics on their incoming inventory data. They need to use their existing Hadoop workloads to perform this task. What two steps must be performed to accomplish this? (Choose two answers)
A
Stream inventory data to Cloud Pub/Sub, process data with Cloud Dataflow into Bigtable and Cloud Storage.
B
Stream from Cloud Pub/Sub into Cloud Dataproc, which can then place relevant data in the appropriate storage location
C
Use Spark to accept the streaming ingest on the Dataproc cluster, and then process jobs on HDFS.
D
Connect Cloud Dataproc to Bigtable and Cloud Storage, running analytics on the data in both services.
Correct Answer:
B
Stream from Cloud Pub/Sub into Cloud Dataproc, which can then place relevant data in the appropriate storage location
Why is this correct?
Dataproc can connect to Pub/Sub for the streaming ingest, when can then process the data and place in the correct location.
Correct Answer:
D
Connect Cloud Dataproc to Bigtable and Cloud Storage, running analytics on the data in both services.
Why is this correct?
Dataproc can natively connect to both services and can run analytics on both.
You are developing an application on Google Cloud that will label famous landmarks in users’ photos. You are under competitive pressure to develop the predictive model quickly. You need to keep service costs low. What should you do?
A
Build and train a classification model with TensorFlow. Deploy the model using the Cloud Machine Learning Engine. Inspect the generated MID values to supply the image labels.
B
Build an application that calls the Cloud Vision API. Pass client image locations as base64-encoded strings.
C
Build an application that calls the Cloud Vision API. Inspect the generated MID values to supply the image labels.
D
Build and train a classification model with TensorFlow. Deploy the model using the Cloud Machine Learning Engine. Pass client image locations as base64- encoded strings.
Correct Answer:
B
Build an application that calls the Cloud Vision API. Pass client image locations as base64-encoded strings.
Why is this correct?
Cloud Vision API supports the ability to generate landmark labels from photos. You would want to pass along the images as base64 encoded strings, not MID. https://linuxacademy.com/cp/courses/lesson/course/2248/lesson/1/module/208
Incorrect Answer:
C
Build an application that calls the Cloud Vision API. Inspect the generated MID values to supply the image labels.
Why is this incorrect?
You would want to pass along base64 encoded strings, not MID. https://linuxacademy.com/cp/courses/lesson/course/2248/lesson/1/module/208