LINUX ACADEMY Google Cloud Data Engineer - Final Exam Flashcards by Margarita Mayoral Villa

As part of your backup plan, you create regular boot-disk snapshots of Compute Engine instances that are running. You want to be able to restore these snapshots using the fewest possible steps for replacement instances. What should you do?
A
Export the snapshots to Cloud Storage. Create images from the exported snapshot files.
B
Use the snapshots to create replacement disks. Use the disks to create instances as needed.
C
Use the snapshots to create replacement instances as needed.
D
Export the snapshots to Cloud Storage. Create disks from the exported snapshot files. Create images from the new disks.

Correct Answer: C
Why is this correct?
Snapshots let you recreate instances in the fewest steps.

How well did you know this?

Not at all

Perfectly

You currently have a Bigtable instance you’ve been using for development running a development instance type, using HDD’s for storage. You are ready to upgrade your development instance to a production instance for increased performance. You also want to upgrade your storage to SSD’s as you need maximum performance for your instance. What should you do?

A
Export your Bigtable data into a new instance, and configure the new instance type as production with SSD’s
B
Upgrade your development instance to a production instance, and switch your storage type from HDD to SSD.
C
Run parallel instances where one instance is using HDD and the other is using SSD.
D
Use the Bigtable instance sync tool in order to automatically synchronize two different instances, with one having the new storage configuration.

Correct Answer:
A
Export your Bigtable data into a new instance, and configure the new instance type as production with SSD’s

Why is this correct?

Since you cannot change the disk type on an existing Bigtable instance, you will need to export/import your Bigtable data into a new instance with the different storage type. You will need to export to Cloud Storage then back to Bigtable again. https://linuxacademy.com/cp/courses/lesson/course/2111/lesson/2/module/208

D
incorrect
This is not an option that exists in Bigtable. https://linuxacademy.com/cp/courses/lesson/course/2111/lesson/2/module/208

How well did you know this?

Not at all

Perfectly

Which of these statements do not apply to preemptible worker nodes on Cloud Dataproc? Choose two answers.
A
You must have a max of 2:1 ratio of preemptible to standard workers.
B
Preemptible workers only function as processing nodes.
C
Your cluster can be created with only preemptible workers
D
Preemptible workers can be added after the cluster is created.

Correct Answer:
A
You must have a max of 2:1 ratio of preemptible to standard workers.

Why is this correct?

There is no ratio requirement, but be aware that preemptible workers can be reclaimed at any time, and you will want a number of standard workers that are always persistent.
Video for reference: Configure Dataproc Cluster and Submit Job – Part 1

Correct Answer:
C
Your cluster can be created with only preemptible workers
Why is this correct?
You must have at least one standard worker in a cluster.
Video for reference: Configure Dataproc Cluster and Submit Job – Part 1

How well did you know this?

Not at all

Perfectly

You are building storage for files for a data pipeline on Google Cloud. You want to support JSON files. The schema of these files will occasionally change. Your analyst teams will use running aggregate ANSI SQL queries on this data. What should you do?
A
Use Cloud Storage for storage. Link data as permanent tables in BigQuery and turn on the Automatically detect option in the Schema section of BigQuery.
B
Use BigQuery for storage. Provide format files for data load. Update the format files as needed.
C
Use BigQuery for storage. Select Automatically detect in the Schema section.
D
Use Cloud Storage for storage. Link data as temporary tables in BigQuery and turn on the Automatically detect option in the Schema section of BigQuery.

Correct Answer:
C
Use BigQuery for storage. Select Automatically detect in the Schema section.
Why is this correct?
This is correct because of the requirement to support occasionally (schema) changing JSON files and aggregate ANSI SQL queries; you need to use BigQuery, and it is quickest to use Automatically detect for schema changes. https://linuxacademy.com/cp/courses/lesson/course/2238/lesson/3/module/208

How well did you know this?

Not at all

Perfectly

Your online shopping company needs to know when a user has not interacted with the site in 30 minutes. They need the website to alert the user once they have been idle for too long. You use Cloud Dataflow to process the interaction events and decide if an alert should be sent. How should you design the pipeline?
A
Implement a session window with a gap time duration of 30 minutes.
B
Implement a fixed-time window with a duration of 30 minutes.
C
Implement a global window with a time-based trigger with a delay of 30 minutes.
D
Implement a sliding time window with a duration of 30 minutes.

Correct Answer:
A
Implement a session window with a gap time duration of 30 minutes.
Why is this correct?
You need a window to be based around the last activity event, which a session window provides.

Incorrect Answer:
D
Why is this incorrect?
You need a window to be based around the last activity event, which a session window provides.

How well did you know this?

Not at all

Perfectly

What is the purpose of hyperparameters in a machine learning training model?
A
Form the basis of labels on your training data.
B
Hyperparameters adjust the training process itself.
C
Train for a regression machine learning problem.
D
They help your model learn from the training data.

Correct Answer:
B
Hyperparameters adjust the training process itself.
Why is this correct?
Learning rate and hidden layers (hyperparameters) are variables that adjust the learning model but have no relation to the training data used. https://linuxacademy.com/cp/courses/lesson/course/2246/lesson/2/module/208

How well did you know this?

Not at all

Perfectly

You are developing an application that will only recognize and tag specific business to business product logos in images. You do not have an extensive background working with machine learning models, but need to get your application working. What is the current best method to accomplish this task?
A
Create a custom machine learning model to recognize specific logos in photos, then train it on AI Platform.
B
Use the Cloud Vision API to recognize logos in the images.
C
Use the AutoML Vision service to train a custom model using the Vision API
D
Use the Cloud Vision API to recognize all logos in images, then use the Cloud Natural Language API to recognize specific logos by name.

Correct Answer:
C
Use the AutoML Vision service to train a custom model using the Vision API

Why is this correct?
The newly added AutoML services allow you to train custom image (and other models) using the Google’s pre-trained API’s as a base. Training a custom model also works on AI Platform, but this route requires less manual model overhead. https://linuxacademy.com/cp/courses/lesson/course/2248/lesson/1/module/208

Incorrect Answer:
B
Use the Cloud Vision API to recognize logos in the images.

Why is this incorrect?
Cloud Vision API can recognize logos, however, to narrow it down to specific logos by business category you would need to train a custom machine learning model on Cloud ML Engine. https://linuxacademy.com/cp/courses/lesson/course/2248/lesson/1/module/208

How well did you know this?

Not at all

Perfectly

You are setting up multiple MySQL databases on Compute Engine. You need to collect logs from your MySQL applications for audit purposes. How should you approach this?

A
Configure Cloud Composer to monitor and report on instance performance metrics.
B
Install the Stackdriver Logging agent on your database instances and configure the fluentd plugin to read and export your MySQL logs into Stackdriver Logging.
C
Install the Stackdriver Monitoring agent on your instances, configure the MySQL plugin, and export logs to Stackdriver Monitoring.
D
Configure Stackdriver Logging to natively monitor application logs, which will appear in Stackdriver Logging.

Correct Answer:
B
Install the Stackdriver Logging agent on your database instances and configure the fluentd plugin to read and export your MySQL logs into Stackdriver Logging.

Why is this correct?
The Stackdriver Logging agent requires the fluentd plugin to be configured to read logs from your database application.

Incorrect Answer:
D
Configure Stackdriver Logging to natively monitor application logs, which will appear in Stackdriver Logging.

Why is this incorrect?
Stackdriver Logging requires the logging agent to be installed and configured to read application logs.

How well did you know this?

Not at all

Perfectly

You are building a data pipeline on Google Cloud. You need to select services that will host a deep neural network machine learning model also hosted on Google Cloud. You also need to monitor and run jobs that could occasionally fail. What should you do?

A
Use the Cloud Machine Learning Engine to host your model. Monitor the status of the Jobs object for ‘failed’ job states.
B
Use the Cloud Machine Learning Engine to host your model. Monitor the status of the Operation object for ‘error’ results.
C
Use a Kubernetes Engine cluster to host your model. Monitor the status of the Jobs object for ‘failed’ job states.
D
Use a Kubernetes Engine cluster to host your model. Monitor the status of the Operation object for ‘error’ results.

Correct Answer:
A
Use the Cloud Machine Learning Engine to host your model. Monitor the status of the Jobs object for ‘failed’ job states.

Why is this correct?
Cloud Machine Learning Engine is the correct service for deep neural network models. You would correctly monitor Jobs for failures. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208

How well did you know this?

Not at all

Perfectly

You have data stored in a Cloud Storage bucket and also in a BigQuery dataset. You need to secure the data and provide 3 different types of access levels for your Google Cloud Platform users: administrator, read/write, and read-only. You want to follow Google-recommended practices. What should you do?
A
At the Organization level, add your administrator user accounts to the Owner role, add your read/write user accounts to the Editor role, and add your read-only user accounts to the Viewer role.
B
At the Project level, add your administrator user accounts to the Owner role, add your read/write user accounts to the Editor role, and add your read-only user accounts to the Viewer role.
C
Create 3 custom IAM roles with appropriate policies for the access levels needed for Cloud Storage and BigQuery. Add your users to the appropriate roles.
D
Use the appropriate pre-defined IAM roles for each of the access levels needed for Cloud Storage and BigQuery. Add your users to those roles for each of the services.

Correct Answer:
D
Use the appropriate pre-defined IAM roles for each of the access levels needed for Cloud Storage and BigQuery. Add your users to those roles for each of the services.

Why is this correct?
The principle of least privilege favors using pre-defined roles for granular access. It is also best practice to use pre-created roles over custom roles with associated policies when they match your requirements.

How well did you know this?

Not at all

Perfectly

11.
You have a long-running, streaming Dataflow pipeline that you need to shut down. You do not need to preserve data currently in the processing pipeline and need it shut down as soon as possible. Which shutdown option should you use to complete the shutdown process?
A
Graceful shutdown
B
Cancel
C
Stop
D
Drain

Correct Answer: 
B
Cancel
Why is this correct?
Cancel will shut down the pipeline without allowing buffered jobs to complete. https://linuxacademy.com/cp/courses/lesson/course/2243/lesson/5/module/208

Incorrect Answer: 
C
Stop
Why is this incorrect?
This is not a valid option. https://linuxacademy.com/cp/courses/lesson/course/2243/lesson/5/module/208

How well did you know this?

Not at all

Perfectly

What open source software is Cloud Pub/Sub most similar to?
A
Apache Beam
B
Apache Kafka
C
HBase
D
Apache Hadoop

Correct Answer: 
B
Apache Kafka
Why is this correct?
Kafka is the open source streaming ingest framework for creating a manual streaming pipeline. https://linuxacademy.com/cp/courses/lesson/course/2241/lesson/2/module/208

How well did you know this?

Not at all

Perfectly

You have hundreds of IoT devices that generate 1 TB of streaming data per day. Due to latency, messages will often be delayed compared to when they were generated. You must be able to account for data arriving late within your processing pipeline. What should you do?
A
Use Cloud SQL to process the delayed messages.
B
Enable your IoT devices to generate a timestamp when sending messages. Use Cloud Dataflow to process messages, and use windows, watermarks (timestamp), and triggers to process late data.
C
Use SQL queries in BigQuery to analyze data by timestamp.
D
Enable your IoT devices to generate a timestamp when sending messages. Use Cloud Pub/Sub to process messages by timestamp and fix out of order issues.

Correct Answer:
B
Enable your IoT devices to generate a timestamp when sending messages. Use Cloud Dataflow to process messages, and use windows, watermarks (timestamp), and triggers to process late data.
Why is this correct?
Dataflow is the service that corrects out of order messages. https://linuxacademy.com/cp/courses/lesson/course/2243/lesson/3/module/208

Your Answer: D
Why is this incorrect?
Pub/Sub does not care about message order; you would use Dataflow to process out of order messages by timestamp. https://linuxacademy.com/cp/courses/lesson/course/2243/lesson/3/module/208

How well did you know this?

Not at all

Perfectly

You work at a very large organizations that has a very large analyst team. You use the default pricing model for BigQuery. During heavy usage, your analyst group occasionally runs out of the 2000 slots available for the BigQuery jobs. You do not want to create additional projects for the sole purpose of increasing slot count. What can you do to resolve this?
A
You must create an additional project to increase your slot count, then spread the BigQuery loads across both projects.
B
Force-enable the ‘use cached results’ option for all available queries.
C
Switch to flat rate pricing to enable a higher total slot quota for your project.
D
Use the quotas page to increase your BigQuery slot count to 3000 as needed.

Correct Answer:
C
Switch to flat rate pricing to enable a higher total slot quota for your project.

Why is this correct?
Flat rate pricing allows you to set a higher slot limit.

How well did you know this?

Not at all

Perfectly

You are an administrator for several organizations in the same company. Each organization has data in their own BigQuery table within a single project. For application access reasons, all of the tables must remain in the same project. You think each organization should be able to view and run queries against their own data without exposing the data of organizations to unauthorized viewers. What should you recommend?
A
You must separate the tables by project, and use a service account in your application to access data in each project. Give out project-wide roles to each organization.
B
Place the tables in a single dataset, and apply IAM roles to each table, limiting access per table to each organization.
C
Create a separate dataset for each organization in the same project. Place each organization’s table in each dataset. Restrict access to the organization’s dataset to only that company, from which they can view their table but no one else’s.
D
Place all data in a single table, create authorized views restricting access by row based on the SESSION_USER() field. Add that same SESSION_USER() field with the same email addresses according to which company needs access to which roles.

Correct Answer:
C
Create a separate dataset for each organization in the same project. Place each organization’s table in each dataset. Restrict access to the organization’s dataset to only that company, from which they can view their table but no one else’s.
Why is this correct?
You can assign roles at the dataset level. Placing tables in different datasets allows you to limit access per dataset. https://linuxacademy.com/cp/courses/lesson/course/2238/lesson/1/module/208

How well did you know this?

Not at all

Perfectly

You are training a machine learning model to predict the likelihood of rain based on an available dataset of weather data. In reviewing your input data, the amount of humidity in the air has a very strong influence on the chance of rain, especially compared to less relevant data. How can you incorporate this more important data so that it properly influences the model?
A
Create a feature from the humidity data point, and use L2 regularization to optimize the model.
B
Tune your hyperparameters to give greater weighting to the humidty feature over others.
C
Create a feature from the humidity data point, and use L1 regularization to optimize the model.
D
Reduce your epochs except for humidity features.

Correct Answer:
C
Create a feature from the humidity data point, and use L1 regularization to optimize the model.

Why is this correct?
L1 regularization is able to reduce the weights of less important features to zero or near zero.

A regression model that uses L1 regularization technique is called Lasso Regression and model which uses L2 is called Ridge Regression.
The key difference between these two is the penalty term.
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Here the highlighted part represents L2 regularization element.
Image for post
Cost function
Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large then it will add too much weight and it will lead to under-fitting. Having said that it’s important how lambda is chosen. This technique works very well to avoid over-fitting issue.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” of coefficient as penalty term to the loss function.
Image for post
Cost function
Again, if lambda is zero then we will get back OLS whereas very large value will make coefficients zero hence it will under-fit.
The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.
Traditional methods like cross-validation, stepwise regression to handle overfitting and perform feature selection work well with a small set of features but these techniques are a great alternative when we are dealing with a large set of features.

How well did you know this?

Not at all

Perfectly

You are building a machine learning model to predict the number of lightning strikes during a storm. Your model has thousands of input features to train on. You want to improve the training speed of the model by removing features, but do not want to negatively effect your model’s accuracy. What action should you take?
A
Combine highly co-dependent and redundant features into one representative feature.
B
Implement L2 regularization to automatically ‘prune’ unneeded features
C
Remove the features that have null values for the majority of your records.
D
Remove features that have high correlation to your output labels.

Correct Answer:
A
Combine highly co-dependent and redundant features into one representative feature.

Why is this correct?
Combining co-dependent and redundant features allows you to reduce the total number of features trained without sacrificing accuracy.

How well did you know this?

Not at all

Perfectly

When training a machine learning model on AI Platform on a distributed scaled tier, what types of machines are part of that distributed resource? (Choose all that apply)
A
Host
B
Worker
C
Master
D
Parameter server

Correct Answer:
B
Worker

Why is this correct?
You can have multiple workers, which divide up the work of training the model. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208

Correct Answer:
C
Master

Why is this correct?
You have a single Master instance per scaled tier. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208

Correct Answer:
D
Parameter server

Why is this correct?
Parameter servers coordinate shared model states between the workers. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208

Incorrect Answer:
A
Host
Why is this incorrect?
This is not one of the scale tier machine types. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208

How well did you know this?

Not at all

Perfectly

Your company needs to run analytics on their incoming inventory data. They need to use their existing Hadoop workloads to perform this task. What two steps must be performed to accomplish this? (Choose two answers)
A
Stream inventory data to Cloud Pub/Sub, process data with Cloud Dataflow into Bigtable and Cloud Storage.
B
Stream from Cloud Pub/Sub into Cloud Dataproc, which can then place relevant data in the appropriate storage location
C
Use Spark to accept the streaming ingest on the Dataproc cluster, and then process jobs on HDFS.
D
Connect Cloud Dataproc to Bigtable and Cloud Storage, running analytics on the data in both services.

Correct Answer:
B
Stream from Cloud Pub/Sub into Cloud Dataproc, which can then place relevant data in the appropriate storage location

Why is this correct?

Dataproc can connect to Pub/Sub for the streaming ingest, when can then process the data and place in the correct location.

Correct Answer:

D
Connect Cloud Dataproc to Bigtable and Cloud Storage, running analytics on the data in both services.

Why is this correct?
Dataproc can natively connect to both services and can run analytics on both.

How well did you know this?

Not at all

Perfectly

You are developing an application on Google Cloud that will label famous landmarks in users’ photos. You are under competitive pressure to develop the predictive model quickly. You need to keep service costs low. What should you do?
A
Build and train a classification model with TensorFlow. Deploy the model using the Cloud Machine Learning Engine. Inspect the generated MID values to supply the image labels.
B
Build an application that calls the Cloud Vision API. Pass client image locations as base64-encoded strings.
C
Build an application that calls the Cloud Vision API. Inspect the generated MID values to supply the image labels.
D
Build and train a classification model with TensorFlow. Deploy the model using the Cloud Machine Learning Engine. Pass client image locations as base64- encoded strings.

Correct Answer:
B
Build an application that calls the Cloud Vision API. Pass client image locations as base64-encoded strings.

Why is this correct?
Cloud Vision API supports the ability to generate landmark labels from photos. You would want to pass along the images as base64 encoded strings, not MID. https://linuxacademy.com/cp/courses/lesson/course/2248/lesson/1/module/208

Incorrect Answer:
C
Build an application that calls the Cloud Vision API. Inspect the generated MID values to supply the image labels.

Why is this incorrect?
You would want to pass along base64 encoded strings, not MID. https://linuxacademy.com/cp/courses/lesson/course/2248/lesson/1/module/208

How well did you know this?

Not at all

Perfectly

You are configuring your Cloud Pub/Sub subscription. Assuming that all requirements are met, which subscription delivery method offers better 'near real-time' delivery of messages?
A
Pull
B
Push
C
Cached
D
Instant

Study These Flashcards

Correct Answer:
B
Push

Why is this correct?
Push deliver has Pub/Sub initiate the transfer of messages to the subscriber, and has overall better performance. Be aware that push delivery has more requirements than pull. https://linuxacademy.com/cp/courses/lesson/course/2241/lesson/2/module/208

Incorrect Answer:
D
Instant

Why is this incorrect?
This is not a valid option. https://linuxacademy.com/cp/courses/lesson/course/2241/lesson/2/module/208

You are designing a relational data repository on Google Cloud to grow as needed. The data will be transactionally consistent and added from any location in the world. You want to monitor and adjust node count for input traffic, which can spike unpredictably. What should you do?
A
Use Cloud Bigtable for storage. Monitor data stored and increase node count if more than 70% utilized.
B
Use Cloud Spanner for storage. Monitor storage usage and increase node count if more than 70% utilized.
C
Use Cloud Bigtable for storage. Monitor CPU utilization and increase node count if more than 70% utilized for your time span.
D
Use Cloud Spanner for storage. Monitor CPU utilization and increase node count if more than 70% utilized for your time span.

Study These Flashcards

Correct Answer:
D
Use Cloud Spanner for storage. Monitor CPU utilization and increase node count if more than 70% utilized for your time span.

Why is this correct?
This is correct because of the requirement for globally scalable transactions—use Cloud Spanner. CPU utilization is the recommended metric for scaling, per Google best practices, linked below. https://linuxacademy.com/cp/courses/lesson/course/2113/lesson/1/module/208

Your organization has just recently started using Google Cloud. Everyone in the company has access to all datasets in BigQuery, using it as they see fit without documenting their use cases. You need to implement a formal security policy, but need to first determine what everyone has been doing in BigQuery. What is your first step to do so?
A
Inspect the IAM policy of each table.
B
Export billing into Cloud Storage, and view BigQuery related records to determine user activity.
C
View the usage of BigQuery query slots in Stackdriver Monitoring.
D
View audit logs in Stackdriver Logging to review data access.

Study These Flashcards

Correct Answer:
D
View audit logs in Stackdriver Logging to review data access.

Why is this correct?
Stackdriver Logging will record the audit logs of jobs and queries of each individual user’s actions.

Which of these is NOT a valid reason to choose an HDD storage type over SSD in a Bigtable instance?
A
You need to maintain costs.
B
You plan on running batch workloads instead of frequently executing random reads across a small number of rows.
C
You need to integrate Bigtable with Cloud Storage
D
You need to store over 10TB of data.

Study These Flashcards

Correct Answer:
C
You need to integrate Bigtable with Cloud Storage

Why is this correct?
This is not a valid reason for choosing HDD storage, therefore it is the correct answer.

Incorrect Answer:
B
You plan on running batch workloads instead of frequently executing random reads across a small number of rows.
Why is this incorrect?
This is a valid reason for choosing HDD storage.

Your infrastructure runs on another cloud and includes a set of multi-TB enterprise databases that are backed up nightly both on-premises and also to that cloud. You need to create a redundant backup to Google Cloud. You are responsible for performing scheduled monthly disaster recovery drills. You want to create a cost-effective solution. What should you do? A Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Nearline storage bucket as a final destination. B Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Coldline storage bucket as a final destination. C Use Transfer Appliance to transfer the offsite backup files to a Cloud Storage Nearline storage bucket as a final destination. D Use Transfer Appliance to transfer the offsite backup files to a Cloud Storage Coldline bucket as a final destination.

Correct Answer: A Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Nearline storage bucket as a final destination. Why is this correct? This is correct because you will need to access your backup data monthly to test your disaster recovery process, so you should use a Nearline bucket; also, because you will be performing ongoing, regular data transfers, so you should use the storage transfer service. https://linuxacademy.com/cp/courses/lesson/course/2103/lesson/3/module/208 Incorrect Answer: C Use Transfer Appliance to transfer the offsite backup files to a Cloud Storage Nearline storage bucket as a final destination. Why is this incorrect? Transfer Appliance is used for on-premises transfers, not cloud-to-cloud, and is not used for repeated/scheduled transfers. https://linuxacademy.com/cp/courses/lesson/course/2103/lesson/3/module/208

You want to export your Cloud SQL tables into BigQuery for analysis. How can you do this? A Convert your Cloud SQL data to JSON format, then import directly into BigQuery B Export your Cloud SQL data to Cloud Storage, then import into BigQuery C Import data from BigQuery directly from Cloud SQL. D Use the BigQuery export function in Cloud SQL to manage exporting data into BigQuery.

Correct Answer: B Export your Cloud SQL data to Cloud Storage, then import into BigQuery Why is this correct? You cannot import data into BigQuery directly from Cloud SQL. You need to export your data to a Cloud Storage bucket first. https://linuxacademy.com/cp/courses/lesson/course/2109/lesson/4/module/208 Incorrect Answer: D Use the BigQuery export function in Cloud SQL to manage exporting data into BigQuery. Why is this incorrect? This is not a function that exists. https://linuxacademy.com/cp/courses/lesson/course/2109/lesson/4/module/208

``` Your company is making the move to Google Cloud and has chosen to use a managed database service to reduce overhead. Your existing database is used for a product catalog that provides real-time inventory tracking for a retailer. Your database is 500 GB in size. The data is semi-structured and does not need full atomicity. You are looking for a truly no-ops/serverless solution. What storage option should you choose? A Cloud Datastore B Cloud Bigtable C Cloud SQL D BigQuery ```

Correct Answer: A Cloud Datastore Why is this correct? Datastore is perfect for semi-structured data less than 1TB in size. Product catalogs are a recommended use case. https://linuxacademy.com/cp/courses/lesson/course/2109/lesson/1/module/208 Incorrect Answer: C Cloud SQL Why is this incorrect? Cloud SQL does not work with semi-structured data. It is also no-ops/serverless. https://linuxacademy.com/cp/courses/lesson/course/2109/lesson/1/module/208

When training a machine learning model, why do you need separate training and test data? A Without different data, your model will not generalize for additional data, known as overfitting. B Both sets of data are necessary for deep and wide neural networks. C Your learning model will have an improper learning rate, making training difficult. D Without separate sets of data, your neural network will not have enough data to train with.

Correct Answer: A Without different data, your model will not generalize for additional data, known as overfitting. Why is this correct? Without separate sets of data, your model will only learn from specifically the training data, and not new data. https://linuxacademy.com/cp/courses/lesson/course/2246/lesson/1/module/208

``` You are creating a machine learning model for predicting a person's income given a variety of factors such as age, race, occupation, and others. What type of problem are we trying to solve in our prediction values? A Classification B Unsupervised learning C Clustering D Linear Regression ```

Correct Answer: D Linear Regression Why is this correct? A linear regression problem is a set of continuous values, such as income, stock prices, etc. By contrast, a logistic regression model is more similar to a classification model (yes/no, true/false, etc). https://linuxacademy.com/cp/courses/lesson/course/2246/lesson/1/module/208

You have 250,000 devices which produce a JSON device status event every 10 seconds. You want to capture this event data for outlier time series analysis. What should you do? A Ship the data into BigQuery. Develop a custom application that uses the BigQuery API to query the dataset and display a device's outlier data based on your business requirements. B Ship the data into Cloud Bigtable. Use the Cloud Bigtable cbt tool to display device outlier data based on your business requirements. C Ship the data into Cloud Bigtable. Install and use the HBase shell for Cloud Bigtable to query the table for the device outlier data based on your business requirements. D Ship the data into BigQuery. Use the BigQuery console to query the dataset and display device outlier data based on your business requirements.

Correct Answer: B Ship the data into Cloud Bigtable. Use the Cloud Bigtable cbt tool to display device outlier data based on your business requirements. Why is this correct? The data type, volume, and query pattern best fits BigTable capabilities and also Google best practices. Also, the cbt tool is a simpler method for access. https://linuxacademy.com/cp/courses/lesson/course/2111/lesson/2/module/208 Incorrect Answer: D Ship the data into BigQuery. Use the BigQuery console to query the dataset and display device outlier data based on your business requirements. Why is this incorrect? You do not need to use BigQuery for the query pattern in this scenario. https://linuxacademy.com/cp/courses/lesson/course/2111/lesson/2/module/208

You are migrating a Hadoop cluster to Cloud Dataproc using GCS for storage. After migration, some of your existing, more complex Spark jobs (in parquet format) are performing noticably worse than your on-premises cluster. You are using mostly preemptible VM's (with a few required non-preemptible) in order to save on costs. A Change your file format to CSV format B Increase the size of your cluster by twice as many preemptible VM's C Switch disks from HDD to SSD. Change the default preemptible VM settings to increase the size of the boot disk. D Switch your disks from HDD to SSD, run the job in HDFS before copying the results back to GCS E Ensure that your parquet files are at an optimized block size

Correct Answer: C Switch disks from HDD to SSD. Change the default preemptible VM settings to increase the size of the boot disk. Why is this correct? By default, preemptible node disk sizes are limited to 100GB or the size of the non-preemptible node disk sizes, whichever is smaller. However you can override the default preemptible disk size to any requested size. Since the majority of our cluster is using preemptible nodes, the size of the disk used for caching operations will see a noticeable performance improvement using a larger disk. Also, SSD's will perform better than HDD. This will increase costs slightly, but is the best option available while maintaining costs. Incorrect Answer: D Switch your disks from HDD to SSD, run the job in HDFS before copying the results back to GCS Why is this incorrect? Preemptible VM nodes on Dataproc do not run HDFS on their disks. It is only used for caching.

Which of these statements is true regarding BigQuery caching? A The BigQuery cache only lasts for 48 hours. B Multiple users can use the same cached query. C Cache is not enabled by default. D Queries that retrieve results from the cache have no charge.

Correct Answer: D Queries that retrieve results from the cache have no charge. Why is this correct? Cached result have no charge. https://linuxacademy.com/cp/courses/lesson/course/2238/lesson/2/module/208

Your organization is ready to migrate their Hadoop workloads to Google Cloud. For the data migration, they need a cost-effective 'data lake' that will scale to their growing data needs and be able to easily connect to their Hadoop workloads in the cloud. What two actions should they perform? A Install the Bigtable connector in the on-premises Hadoop cluster, then migrate data to Bigtable for long-term storage. B Add the Cloud Storage connector to their on-premises Hadoop environment, and transfer their data to a Cloud Storage bucket. C For the existing Hadoop jobs that are migrating to Dataproc, use the gs:// prefix instead of hdfs:// to access data from Cloud Storage. D Create a Dataproc cluster for long-term use, and transfer data to the HDFS partition on the cluster.

Correct Answer: B Add the Cloud Storage connector to their on-premises Hadoop environment, and transfer their data to a Cloud Storage bucket. Why is this correct? Best practice is to use Cloud Storage with Dataproc for long-term storage that does not require a Dataproc cluster to be constantly running. Correct Answer: C For the existing Hadoop jobs that are migrating to Dataproc, use the gs:// prefix instead of hdfs:// to access data from Cloud Storage. Why is this correct? Cloud Storage can be natively accessed for both input and output by Dataproc, eliminating the need to use HDFS for long-term storage. Incorrect Answer: D Create a Dataproc cluster for long-term use, and transfer data to the HDFS partition on the cluster. Why is this incorrect? This technically works; however, it requires the cluster to be constantly available just for data access. The better practice is to use Cloud Storage.

``` In a Dataflow processing pipeline, which concept describes timestamps attached to incoming messages? A Watermark B ParDo C PCollection D Trigger ```

Correct Answer: A Watermark Why is this correct? Watermark describes the event time, which is what a timestamp designates. https://linuxacademy.com/cp/courses/lesson/course/2243/lesson/3/module/208

You are building a data pipeline on Google Cloud. You need to prepare source data for a machine-learning model. This involves quickly deduplicating rows from three input tables and also removing outliers from data columns where you do not know the data distribution. What should you do? A Use Cloud Dataprep to preview the data distributions in sample source data table columns. Write a recipe to transform the data and add it to the Cloud Dataprep job. B Write an Apache Spark job with a series of steps for Cloud Dataflow. The first step will examine the source data, and the second and third steps will perform data transformations. C Use Cloud Dataprep to preview the data distributions in sample source data table columns. Click on each column name, click on each appropriate suggested transformation, and then click Add to add each transformation to the Cloud Dataprep job. D Write an Apache Spark job with a series of steps for Cloud Dataproc. The first step will examine the source data, and the second and third steps will perform data transformations.

Correct Answer: C Use Cloud Dataprep to preview the data distributions in sample source data table columns. Click on each column name, click on each appropriate suggested transformation, and then click Add to add each transformation to the Cloud Dataprep job. Why is this correct? Dataprep is the correct choice because of the requirements to prepare/clean source data. For deduplication, using the suggestion transformation would be easier and quicker than writing a recipe, which is more work than needed. https://linuxacademy.com/cp/courses/lesson/course/2244/lesson/1/module/208 Incorrect Answer: A Use Cloud Dataprep to preview the data distributions in sample source data table columns. Write a recipe to transform the data and add it to the Cloud Dataprep job. Why is this incorrect? For deduplication, using the suggestion transformation would be easier and quicker than writing a recipe, which is more work than needed. https://linuxacademy.com/cp/courses/lesson/course/2244/lesson/1/module/208

``` You are using a Compute Engine instance to manage your Cloud Dataflow processing workloads. What IAM role do you need to grant to the instance so that it has the necessary access? A Dataflow Viewer B Dataflow Developer C Dataflow Worker D Dataflow Computer ```

Correct Answer: C Dataflow Worker Why is this correct? Dataflow Worker is assigned to the Compute Engine service account for necessary access. https://linuxacademy.com/cp/courses/lesson/course/2243/lesson/2/module/208

You need to deploy a TensorFlow machine-learning model to Google Cloud. You want to maximize the speed and minimize the cost of model prediction and deployment. What should you do? A Export 2 copies of your trained model to a SavedModel format. Store artifacts in Cloud Storage. Run 1 version on CPUs and another version on GPUs. B Export 2 copies of your trained model to a SavedModel format. Store artifacts in Cloud ML Engine. Run 1 version on CPUs and another version on GPUs. C Export your trained model to a SavedModel format. Deploy and run your model from a Kubernetes Engine cluster D Export your trained model to a SavedModel format. Deploy and run your model on Cloud ML Engine.

Correct Answer: D Export your trained model to a SavedModel format. Deploy and run your model on Cloud ML Engine. Why is this correct? This is the preferred method to fulfill the requirement to minimize costs. https://linuxacademy.com/cp/courses/lesson/course/2247/lesson/1/module/208

``` You are training a facial detection machine learning model. Your model is suffering from overfitting your training data. Choose three steps you can take to solve this problem A Use a larger set of features B Use a smaller set of features C Reduce the number of training examples D Increase the number of training examples E Increase the regularization parameters F Decrease the regularization parameters ```

Correct Answer: B Use a smaller set of features Why is this correct? Reducing the number of unneeded features can help reduce overfitting. Correct Answer: D Increase the number of training examples Why is this correct? More data is one of the best methods to increase the variety of samples and better generalize your model. Correct Answer: E Increase the regularization parameters Why is this correct? Increasing your regularization parameters allows you to reduce 'noise' in your model to reduce overfitting.

``` Which of these options are adjusted by a machine learning neural network as it works with its training dataset? (Choose all that apply) A Biases B Weights C Epochs D Features ```

Correct Answer: A Biases Why is this correct? Biases are a parameter that is adjusted for a neural network to learn from its training data. https://linuxacademy.com/cp/courses/lesson/course/2246/lesson/2/module/208 Correct Answer: B Weights Why is this correct? Weights are a parameter that adjusts for a neural network to learn from its training data. https://linuxacademy.com/cp/courses/lesson/course/2246/lesson/2/module/208

You are working on a project with two compliance requirements. The first requirement states that your developers should be able to see the Google Cloud Platform billing charges for only their projects. The second requirement states that your finance team members can set budgets and view the current charges for all projects in the organization. The finance team should not be able to view the project contents. You want to set permissions. What should you do A Add the finance team to the Viewer role for the Project. Add the developers to the Security Reviewer role for each of the billing accounts. B Add the developers and finance managers to the Viewer role for the Project. C Add the finance team members to the default IAM Owner role. Add the developers to a custom role that allows them to see their spending only. D Add the finance team members to the Billing Administrator role for each of the billing accounts that they need to manage. Add the developers to the Viewer role for the Project.

Correct Answer: D Add the finance team members to the Billing Administrator role for each of the billing accounts that they need to manage. Add the developers to the Viewer role for the Project. Why is this correct? This answer uses the principle of least privilege for IAM roles. https://cloud.google.com/iam/docs/understanding-roles

You need to replicate the logs that are ingested by your on-premises Apache Kafka cluster to Google Cloud to be stored for analysis in BigQuery. What should you do? A Create an identical Kafka cluster on Compute Engine in GCP. Configure your on-premises Kafka cluster to duplicate all data to the GCP Kafka cluster. Use a Dataflow job to process data from Kafka and insert into BigQuery. B Configure the Pub/Sub Kafka connector on your on-premises Kafka cluster, and configure Pub/Sub as a source connector. Use a Cloud Dataflow job to read from a subscribed Pub/Sub topic and write to BigQuery C Create a Cloud Composer workflow to manage the replication of data from your Kafka cluster directly into BigQuery. D Configure the Pub/Sub Kafka connector on your on-premises Kafka cluster, and configure Pub/Sub as a sink connector. Use a Cloud Dataflow job to read from a subscribed Pub/Sub topic and write to BigQuery

Correct Answer: D Configure the Pub/Sub Kafka connector on your on-premises Kafka cluster, and configure Pub/Sub as a sink connector. Use a Cloud Dataflow job to read from a subscribed Pub/Sub topic and write to BigQuery Why is this correct? You can connect Kafka to GCP by using a connector. The 'downstream' service (Pub/Sub) will use a sink connector.

``` Pick two benefits of using denormalized data in BigQuery? (Choose all that apply) A Decreased query complexity B Less storage space used C Increased query performance D Reduces the amount of data processed ```

Correct Answer: A Decreased query complexity Why is this correct? Not having to use JOIN clauses due to combined tables makes queries easier. https://linuxacademy.com/cp/courses/lesson/course/2238/lesson/4/module/208 Correct Answer: C Increased query performance Why is this correct? Denormalizing data increases performance on denormalized data since all of the data is in a single table instead of relying on JOIN's to combine multiple tables' data. https://linuxacademy.com/cp/courses/lesson/course/2238/lesson/4/module/208

You regularly use prefetch caching with a Data Studio report to visualize the results of BigQuery queries. You want to minimize service costs. What should you do? A Set up the report to use the Owner's credentials to access the underlying data in BigQuery, and verify that the Enable cache checkbox is selected for the report. B Set up the report to use the Owner's credentials to access the underlying data in BigQuery, and direct the users to view the report only once per business day (24-hour period). C Set up the report to use the Viewer's credentials to access the underlying data in BigQuery, and verify that the Enable cache checkbox is not selected for the report. D Set up the report to use the Viewer's credentials to access the underlying data in BigQuery, and also set it up to be a view-only report.

Correct Answer: A Set up the report to use the Owner's credentials to access the underlying data in BigQuery, and verify that the Enable cache checkbox is selected for the report. Why is this correct? You must set Owner credentials to use the enable cache option in BigQuery. It is also a Google best practice to use the enable cache option when the business scenario calls for using prefetch caching. https://linuxacademy.com/cp/courses/lesson/course/2250/lesson/1/module/208

Your production Bigtable instance is currently using four nodes. Due to the increased size of your table, you need to add additional nodes to offer better performance. How should you accomplish this without the risk of data loss? A Power off your Bigtable instance, then increase the node count, then power back on. Be sure to schedule downtime in advance. B Export your Bigtable data as sequence files into Cloud Storage, then import the data into a new Bigtable instance with additional nodes added. C Use the node migration service to add additional nodes. D Edit instance details and increase the number of nodes. Save your changes. Data will re-distribute with no downtime.

Correct Answer: D Edit instance details and increase the number of nodes. Save your changes. Data will re-distribute with no downtime. Why is this correct? You can add/remove nodes to Bigtable with no downtime necessary. https://linuxacademy.com/cp/courses/lesson/course/2111/lesson/2/module/208 Incorrect Answer: B Export your Bigtable data as sequence files into Cloud Storage, then import the data into a new Bigtable instance with additional nodes added. Why is this incorrect? This action is not necessary because you can increase node count with no downtime. https://linuxacademy.com/cp/courses/lesson/course/2111/lesson/2/module/208

You need to choose a structure storage option for storing very large amounts of data with the following properties and requirements: The data has a single key You need very low latency Which solution should you choose? ``` A Bigtable B Datastore C Cloud SQL D BigQuery ```

Correct Answer: A Bigtable Why is this correct? Bigtable uses a single key and has very low latency (in milliseconds). It is the best choice. Incorrect Answer: D BigQuery Why is this incorrect? BigQuery does not use single key values and has latency measured in seconds, not milliseconds.

Your team has decided to use Datalab for interactive machine learning exercises. You want your team members to share their work and progress with each other. How do you accomplish this? A Every team member will use their own Datalab notebook and synchronize changes to the shared Cloud Source Repository. B Use the team sync feature included in Datalab notebooks to synchronize each member's work. C Give your team members Compute Instance Admin and Service Account Actor roles to access a shared notebook. D Create a shared Datalab notebook, and assign the Datalab Editor role to your team members to access it.

Correct Answer: A Every team member will use their own Datalab notebook and synchronize changes to the shared Cloud Source Repository. Why is this correct? This is the correct method of sharing work. https://linuxacademy.com/cp/courses/lesson/course/2251/lesson/1/module/208 Incorrect Answer: D Create a shared Datalab notebook, and assign the Datalab Editor role to your team members to access it. Why is this incorrect? Datalab notebooks are per-user only. You need to use the Cloud Source Repository to share work. https://linuxacademy.com/cp/courses/lesson/course/2251/lesson/1/module/208

Your organization needs to be able to reliably handle ever-increasing amounts of streaming telemetry data, process it, and economically store analyzed data. What services should they use for this task? A Stackdriver, Cloud Dataproc, Cloud Spanner B Cloud Pub/Sub, Cloud Dataproc, Bigtable C Cloud Pub/Sub, Cloud Dataflow, Bigquery D Kubernetes Engine, Cloud Dataflow, Cloud Datastore

Correct Answer: C Cloud Pub/Sub, Cloud Dataflow, Bigquery Why is this correct? Pub/Sub for streaming data ingest, Dataflow for processing streaming data, and BigQuery for storage and analysis.

``` You need to process transactions in a point-of-sale application on Google Cloud Platform. You need to account for exponential user growth, but you do not want to deal with managing your infrastructure overhead. Which database service option should you use? A Cloud Datastore B Cloud SQL C BigQuery D Cloud Memorystore ```

Correct Answer: A Cloud Datastore Why is this correct? Cloud Datastore is a true no-ops, serverless database that is ideal for non-relational, point of sale transactional data. It can grow exponentially without having to manage infrastructure.

Your company’s aging Hadoop servers are nearing end of life. Instead of replacing your hardware, your CIO has decided to migrate the cluster to Google Cloud Dataproc. A direct lift and shift migration of the cluster would require 30 TB of disk space per individual node. There are cost concerns about using that much storage. How can you best minimize the cost of the migration? A Decouple storage from computer by placing the data in Cloud Storage B Place archived data in Cloud Storage, and only use 'hot' data in HDFS on the cluster disks. C Implement maximum data compression to reduce the amount of disk space your data uses. D Use preemptible VM's to save costs on cluster storage usage.

Correct Answer: A Decouple storage from computer by placing the data in Cloud Storage Why is this correct? Placing all input and output data in Cloud Storage allows you to 1. Treat clusters as ephemeral and 2. Use a much cheaper storage location compared to persistent disks without a noticeable impact on performance.

What will happen to your data in a Bigtable instance if a node goes down? A Bigtable will attempt to rebuild the data from RAID disk configuration when the node comes back online. B Nothing, as the storage is separated from the node compute. C Lost data will automatically rebuild itself from Cloud Storage backups when the node comes back online. D Data will be lost, which makes regular backups to Cloud Storage necessary.

Correct Answer: B Nothing, as the storage is separated from the node compute. Why is this correct? Storage and compute are separate, so a node going down may affect performance, but not data integrity. Nodes only store pointers to storage as metadata. https://linuxacademy.com/cp/courses/lesson/course/2111/lesson/1/mod Incorrect Answer: C Lost data will automatically rebuild itself from Cloud Storage backups when the node comes back online. Why is this incorrect? Losing a node does not lose data. https://linuxacademy.com/cp/courses/lesson/course/2111/lesson/1/module/208

LINUX ACADEMY Google Cloud Data Engineer - Final Exam Flashcards

(50 cards)