Exam Topics 3 Flashcards by Anh Do

You need to copy millions of sensitive patient records from a relational database to BigQuery. The total size of the database is 10 TB. You need to design a solution that is secure and time-efficient. What should you do?
A. Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
B. Export the records from the database as an Avro file. Copy the file onto a Transfer Appliance and send it to Google, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
C. Export the records from the database into a CSV file. Create a public URL for the CSV file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the CSV file into BigQuery using the BigQuery web UI in the GCP Console.
D. Export the records from the database as an Avro file. Create a public URL for the Avro file, and then use Storage Transfer Service to move the file to Cloud Storage. Load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.

A. Export the records from the database as an Avro file. Upload the file to GCS using gsutil, and then load the Avro file into BigQuery using the BigQuery web UI in the GCP Console.
Reason: Avro is more effecient so C is out. Since the data is sensitive, D is out. B is out to since we don’t want the risk of Transfer Appliance.

How well did you know this?

Not at all

Perfectly

You need to create a near real-time inventory dashboard that reads the main inventory tables in your BigQuery data warehouse. Historical inventory data is stored as inventory balances by item and location. You have several thousand updates to inventory every hour. You want to maximize performance of the dashboard and ensure that the data is accurate. What should you do?
A. Leverage BigQuery UPDATE statements to update the inventory balances as they are changing.
B. Partition the inventory balance table by item to reduce the amount of data scanned with each inventory update.
C. Use the BigQuery streaming the stream changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.
D. Use the BigQuery bulk loader to batch load inventory changes into a daily inventory movement table. Calculate balances in a view that joins it to the historical inventory balance table. Update the inventory balance table nightly.

A. Leverage BigQuery UPDATE statements to update the inventory balances as they are changing.
Reason: Needs real time so C and D are out. B would create too many partition.

How well did you know this?

Not at all

Perfectly

You have a data stored in BigQuery. The data in the BigQuery dataset must be highly available. You need to define a storage, backup, and recovery strategy of this data that minimizes cost. How should you configure the BigQuery table?
A. Set the BigQuery dataset to be regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
B. Set the BigQuery dataset to be regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.
C. Set the BigQuery dataset to be multi-regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
D. Set the BigQuery dataset to be multi-regional. Create a scheduled query to make copies of the data to tables suffixed with the time of the backup. In the event of an emergency, use the backup copy of the table.

C. Set the BigQuery dataset to be multi-regional. In the event of an emergency, use a point-in-time snapshot to recover the data.
Reason: highly available = multi-regional. There is snapshot built-in so no need for D

How well did you know this?

Not at all

Perfectly

You used Cloud Dataprep to create a recipe on a sample of data in a BigQuery table. You want to reuse this recipe on a daily upload of data with the same schema, after the load job with variable execution time completes. What should you do?
A. Create a cron schedule in Cloud Dataprep.
B. Create an App Engine cron job to schedule the execution of the Cloud Dataprep job.
C. Export the recipe as a Cloud Dataprep template, and create a job in Cloud Scheduler.
D. Export the Cloud Dataprep job as a Cloud Dataflow template, and incorporate it into a Cloud Composer job.

D. Export the Cloud Dataprep job as a Cloud Dataflow template, and incorporate it into a Cloud Composer job.
Reason: cron job and Cloud Scheduler can only run at a fixed time. The requirements is follow a load job with variable execution time, so Cloud Composer is needed.

How well did you know this?

Not at all

Perfectly

You want to automate execution of a multi-step data pipeline running on Google Cloud. The pipeline includes Cloud Dataproc and Cloud Dataflow jobs that have multiple dependencies on each other. You want to use managed services where possible, and the pipeline will run every day. Which tool should you use?
A. cron
B. Cloud Composer
C. Cloud Scheduler
D. Workflow Templates on Cloud Dataproc

B. Cloud Composer

How well did you know this?

Not at all

Perfectly

You are managing a Cloud Dataproc cluster. You need to make a job run faster while minimizing costs, without losing work in progress on your clusters. What should you do?
A. Increase the cluster size with more non-preemptible workers.
B. Increase the cluster size with preemptible worker nodes, and configure them to forcefully decommission.
C. Increase the cluster size with preemptible worker nodes, and use Cloud Stackdriver to trigger a script to preserve work.
D. Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.

D. Increase the cluster size with preemptible worker nodes, and configure them to use graceful decommissioning.
Reason: preemptible worker is cheaper. Use graceful decommission to not losing work in progress.

How well did you know this?

Not at all

Perfectly

You work for a shipping company that uses handheld scanners to read shipping labels. Your company has strict data privacy standards that require scanners to only transmit recipients’ personally identifiable information (PII) to analytics systems, which violates user privacy rules. You want to quickly build a scalable solution using cloud-native managed services to prevent exposure of PII to the analytics systems. What should you do?
A. Create an authorized view in BigQuery to restrict access to tables with sensitive data.
B. Install a third-party data validation tool on Compute Engine virtual machines to check the incoming data for sensitive information.
C. Use Stackdriver logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.
D. Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.

D. Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.
Reason: DLP is for scanning sensitive data

How well did you know this?

Not at all

Perfectly

You have developed three data processing jobs. One executes a Cloud Dataflow pipeline that transforms data uploaded to Cloud Storage and writes results to BigQuery. The second ingests data from on-premises servers and uploads it to Cloud Storage. The third is a Cloud Dataflow pipeline that gets information from third-party data providers and uploads the information to Cloud Storage. You need to be able to schedule and monitor the execution of these three workflows and manually execute them when needed. What should you do?
A. Create a Direct Acyclic Graph in Cloud Composer to schedule and monitor the jobs.
B. Use Stackdriver Monitoring and set up an alert with a Webhook notification to trigger the jobs.
C. Develop an App Engine application to schedule and request the status of the jobs using GCP API calls.
D. Set up cron jobs in a Compute Engine instance to schedule and monitor the pipelines using GCP API calls.

A. Create a Direct Acyclic Graph in Cloud Composer to schedule and monitor the jobs.

How well did you know this?

Not at all

Perfectly

You have Cloud Functions written in Node.js that pull messages from Cloud Pub/Sub and send the data to BigQuery. You observe that the message processing rate on the Pub/Sub topic is orders of magnitude higher than anticipated, but there is no error logged in Stackdriver Log Viewer. What are the two most likely causes of this problem? (Choose two.)
A. Publisher throughput quota is too small.
B. Total outstanding messages exceed the 10-MB maximum.
C. Error handling in the subscriber code is not handling run-time errors properly.
D. The subscriber code cannot keep up with the messages.
E. The subscriber code does not acknowledge the messages that it pulls.

C. Error handling in the subscriber code is not handling run-time errors properly.
E. The subscriber code does not acknowledge the messages that it pulls.
Reason: A, B is wrong. D could be the underlying reason for errors, but not for why no error logged.

How well did you know this?

Not at all

Perfectly

You are creating a new pipeline in Google Cloud to stream IoT data from Cloud Pub/Sub through Cloud Dataflow to BigQuery. While previewing the data, you notice that roughly 2% of the data appears to be corrupt. You need to modify the Cloud Dataflow pipeline to filter out this corrupt data. What should you do?
A. Add a SideInput that returns a Boolean if the element is corrupt.
B. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
C. Add a Partition transform in Cloud Dataflow to separate valid data from corrupt data.
D. Add a GroupByKey transform in Cloud Dataflow to group all of the valid data together and discard the rest.

B. Add a ParDo transform in Cloud Dataflow to discard corrupt elements.
Reason: Use ParDo when possible. A is wrong, no need SideInput. C and D are wrong since Partition and GroupByKey is not related to corruption.

How well did you know this?

Not at all

Perfectly

You have historical data covering the last three years in BigQuery and a data pipeline that delivers new data to BigQuery daily. You have noticed that when the Data Science team runs a query filtered on a date column and limited to 30”“90 days of data, the query scans the entire table. You also noticed that your bill is increasing more quickly than you expected. You want to resolve the issue as cost-effectively as possible while maintaining the ability to conduct SQL queries.
What should you do?
A. Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.
B. Recommend that the Data Science team export the table to a CSV file on Cloud Storage and use Cloud Datalab to explore the data by reading the files directly.
C. Modify your pipeline to maintain the last 30”“90 days of data in one table and the longer history in a different table to minimize full table scans over the entire history.
D. Write an Apache Beam pipeline that creates a BigQuery table per day. Recommend that the Data Science team use wildcards on the table name suffixes to select the data they need.

A. Re-create the tables using DDL. Partition the tables by a column containing a TIMESTAMP or DATE Type.
Reason: Partition works.

How well did you know this?

Not at all

Perfectly

You operate a logistics company, and you want to improve event delivery reliability for vehicle-based sensors. You operate small data centers around the world to capture these events, but leased lines that provide connectivity from your event collection infrastructure to your event processing infrastructure are unreliable, with unpredictable latency. You want to address this issue in the most cost-effective way. What should you do?
A. Deploy small Kafka clusters in your data centers to buffer events.
B. Have the data acquisition devices publish data to Cloud Pub/Sub.
C. Establish a Cloud Interconnect between all remote data centers and Google.
D. Write a Cloud Dataflow pipeline that aggregates all data in session windows.

B. Have the data acquisition devices publish data to Cloud Pub/Sub.
Reason: Pub/Sub is available and cost effective. A is more complicated than B. C is expensive.

How well did you know this?

Not at all

Perfectly

You are a retailer that wants to integrate your online sales capabilities with different in-home assistants, such as Google Home. You need to interpret customer voice commands and issue an order to the backend systems. Which solutions should you choose?
A. Cloud Speech-to-Text API
B. Cloud Natural Language API
C. Dialogflow Enterprise Edition
D. Cloud AutoML Natural Language

C. Dialogflow Enterprise Edition
Reason: A just provide the text file which doesn’t do anything. B has sentiment analysis, entity analysis, but not intent. D is wrong since it allows the build of custom model.

How well did you know this?

Not at all

Perfectly

Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?
A. Cloud Dataflow
B. Cloud Composer
C. Cloud Dataprep
D. Cloud Dataproc

B. Cloud Composer

How well did you know this?

Not at all

Perfectly

You use a dataset in BigQuery for analysis. You want to provide third-party companies with access to the same dataset. You need to keep the costs of data sharing low and ensure that the data is current. Which solution should you choose?
A. Create an authorized view on the BigQuery table to control data access, and provide third-party companies with access to that view.
B. Use Cloud Scheduler to export the data on a regular basis to Cloud Storage, and provide third-party companies with access to the bucket.
C. Create a separate dataset in BigQuery that contains the relevant data to share, and provide third-party companies with access to the new dataset.
D. Create a Cloud Dataflow job that reads the data in frequent time intervals, and writes it to the relevant BigQuery dataset or Cloud Storage bucket for third-party companies to use.

A. Create an authorized view on the BigQuery table to control data access, and provide third-party companies with access to that view.
Reason: View is the way to go. B and D are completed and not up to date. C is wrong since new dataset is costly.

How well did you know this?

Not at all

Perfectly

A shipping company has live package-tracking data that is sent to an Apache Kafka stream in real time. This is then loaded into BigQuery. Analysts in your company want to query the tracking data in BigQuery to analyze geospatial trends in the lifecycle of a package. The table was originally created with ingest-date partitioning. Over time, the query processing time has increased. You need to implement a change that would improve query performance in BigQuery. What should you do?
A. Implement clustering in BigQuery on the ingest date column.
B. Implement clustering in BigQuery on the package-tracking ID column.
C. Tier older data onto Cloud Storage files, and leverage extended tables.
D. Re-create the table using data partitioning on the package delivery date.

B. Implement clustering in BigQuery on the package-tracking ID column.
Reason: clustering helps. A is wrong since it is already partitioned by date.

How well did you know this?

Not at all

Perfectly

You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once and must be ordered within windows of 1 hour. How should you design the solution?
A. Use Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.
B. Use Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.
C. Use Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.
D. Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.

D. Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.

How well did you know this?

Not at all

Perfectly

You need to set access to BigQuery for different departments within your company. Your solution should comply with the following requirements:
✑ Each department should have access only to their data.
Each department will have one or more leads who need to be able to create and update tables and provide them to their team.

✑ Each department has data analysts who need to be able to query but not modify data.
How should you set access to the data in BigQuery?
A. Create a dataset for each department. Assign the department leads the role of OWNER, and assign the data analysts the role of WRITER on their dataset.
B. Create a dataset for each department. Assign the department leads the role of WRITER, and assign the data analysts the role of READER on their dataset.
C. Create a table for each department. Assign the department leads the role of Owner, and assign the data analysts the role of Editor on the project the table is in.
D. Create a table for each department. Assign the department leads the role of Editor, and assign the data analysts the role of Viewer on the project the table is in.

B. Create a dataset for each department. Assign the department leads the role of WRITER, and assign the data analysts the role of READER on their dataset.

How well did you know this?

Not at all

Perfectly

You operate a database that stores stock trades and an application that retrieves average stock price for a given company over an adjustable window of time. The data is stored in Cloud Bigtable where the datetime of the stock trade is the beginning of the row key. Your application has thousands of concurrent users, and you notice that performance is starting to degrade as more stocks are added. What should you do to improve the performance of your application?
A. Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol.
B. Change the row key syntax in your Cloud Bigtable table to begin with a random number per second.
C. Change the data pipeline to use BigQuery for storing stock trades, and update your application.
D. Use Cloud Dataflow to write summary of each day’s stock trades to an Avro file on Cloud Storage. Update your application to read from Cloud Storage and Cloud Bigtable to compute the responses.

A. Change the row key syntax in your Cloud Bigtable table to begin with the stock symbol.
Reason: In row key, timestamp should be after

How well did you know this?

Not at all

Perfectly

You are operating a Cloud Dataflow streaming pipeline. The pipeline aggregates events from a Cloud Pub/Sub subscription source, within a window, and sinks the resulting aggregation to a Cloud Storage bucket. The source has consistent throughput. You want to monitor an alert on behavior of the pipeline with Cloud
Stackdriver to ensure that it is processing data. Which Stackdriver alerts should you create?
A. An alert based on a decrease of subscription/num_undelivered_messages for the source and a rate of change increase of instance/storage/ used_bytes for the destination
B. An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/ used_bytes for the destination
C. An alert based on a decrease of instance/storage/used_bytes for the source and a rate of change increase of subscription/ num_undelivered_messages for the destination
D. An alert based on an increase of instance/storage/used_bytes for the source and a rate of change decrease of subscription/ num_undelivered_messages for the destination

B. An alert based on an increase of subscription/num_undelivered_messages for the source and a rate of change decrease of instance/storage/ used_bytes for the destination
Reason: source is Pub/Sub so must be subscription and number of undelivered messages, increasing is bad. Destination is GCS so must be instance/storage/used bytes, decreasing means some errors in the pipeline, which is bad.

How well did you know this?

Not at all

Perfectly

You currently have a single on-premises Kafka cluster in a data center in the us-east region that is responsible for ingesting messages from IoT devices globally.
Because large parts of globe have poor internet connectivity, messages sometimes batch at the edge, come in all at once, and cause a spike in load on your
Kafka cluster. This is becoming difficult to manage and prohibitively expensive. What is the Google-recommended cloud native architecture for this scenario?
A. Edge TPUs as sensor devices for storing and transmitting the messages.
B. Cloud Dataflow connected to the Kafka cluster to scale the processing of incoming messages.
C. An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub.
D. A Kafka cluster virtualized on Compute Engine in us-east with Cloud Load Balancing to connect to the devices around the world.

C. An IoT gateway connected to Cloud Pub/Sub, with Cloud Dataflow to read and process the messages from Cloud Pub/Sub.

You decided to use Cloud Datastore to ingest vehicle telemetry data in real time. You want to build a storage system that will account for the long-term data growth, while keeping the costs low. You also want to create snapshots of the data periodically, so that you can make a point-in-time (PIT) recovery, or clone a copy of the data for Cloud Datastore in a different environment. You want to archive these snapshots for a long time. Which two methods can accomplish this?
(Choose two.)
A. Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class.
B. Use managed export, and then import to Cloud Datastore in a separate project under a unique namespace reserved for that export.
C. Use managed export, and then import the data into a BigQuery table created just for that export, and delete temporary export files.
D. Write an application that uses Cloud Datastore client libraries to read all the entities. Treat each entity as a BigQuery table row via BigQuery streaming insert. Assign an export timestamp for each export, and attach it as an extra column for each row. Make sure that the BigQuery table is partitioned using the export timestamp column.
E. Write an application that uses Cloud Datastore client libraries to read all the entities. Format the exported data into a JSON file. Apply compression before storing the data in Cloud Source Repositories.

A. Use managed export, and store the data in a Cloud Storage bucket using Nearline or Coldline class.
D. Write an application that uses Cloud Datastore client libraries to read all the entities. Treat each entity as a BigQuery table row via BigQuery streaming insert. Assign an export timestamp for each export, and attach it as an extra column for each row. Make sure that the BigQuery table is partitioned using the export timestamp column.

Reason: A is the best solution in term of cost. D is more expensive, but can do point-in-time recovery. The rest cannot do recovery.

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis.
Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? (Choose two.)
A. Denormalize the data as must as possible.
B. Preserve the structure of the data as much as possible.
C. Use BigQuery UPDATE to further reduce the size of the dataset.
D. Develop a data pipeline where status updates are appended to BigQuery instead of updated.
E. Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery’s support for external data sources to query.

A. Denormalize the data as must as possible.
D. Develop a data pipeline where status updates are appended to BigQuery instead of updated.
Reason: Denormalize is good for query so A is correct. D is correct because update takes a long time. E is wrong because avro doesn’t help.

You are designing a cloud-native historical data processing system to meet the following conditions:
✑ The data being analyzed is in CSV, Avro, and PDF formats and will be accessed by multiple analysis tools including Cloud Dataproc, BigQuery, and Compute Engine.
✑ A streaming data pipeline stores new data daily.
✑ Peformance is not a factor in the solution.
✑ The solution design should maximize availability.
How should you design data storage for this solution?
A. Create a Cloud Dataproc cluster with high availability. Store the data in HDFS, and peform analysis as needed.
B. Store the data in BigQuery. Access the data using the BigQuery Connector on Cloud Dataproc and Compute Engine.
C. Store the data in a regional Cloud Storage bucket. Access the bucket directly using Cloud Dataproc, BigQuery, and Compute Engine.
D. Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.

D. Store the data in a multi-regional Cloud Storage bucket. Access the data directly using Cloud Dataproc, BigQuery, and Compute Engine.
Reason: Only GCS can store different file formats. Maximum availability = multi-regional.

You have a petabyte of analytics data and need to design a storage and processing platform for it. You must be able to perform data warehouse-style analytics on the data in Google Cloud and expose the dataset as files for batch analysis tools in other cloud providers. What should you do? A. Store and process the entire dataset in BigQuery. B. Store and process the entire dataset in Cloud Bigtable. C. Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket. D. Store the warm data as files in Cloud Storage, and store the active data in BigQuery. Keep this ratio as 80% warm and 20% active.

C. Store the full dataset in BigQuery, and store a compressed copy of the data in a Cloud Storage bucket. Reason: Analytics = Big Query. Expose = GCS. Need to expose the dataset to other cloud providers so all data has to be stored in both places. Thus, D is wrong

You work for a manufacturing company that sources up to 750 different components, each from a different supplier. You've collected a labeled dataset that has on average 1000 examples for each unique component. Your team wants to implement an app to help warehouse workers recognize incoming components based on a photo of the component. You want to implement the first working version of this app (as Proof-Of-Concept) within a few working days. What should you do? A. Use Cloud Vision AutoML with the existing dataset. B. Use Cloud Vision AutoML, but reduce your dataset twice. C. Use Cloud Vision API by providing custom labels as recognition hints. D. Train your own image recognition model leveraging transfer learning techniques.

B. Use Cloud Vision AutoML, but reduce your dataset twice. Reason: This is custom dataset so needs Cloud Vision AutoML. Since it is Proof-Of-Concept, reduce dataset will reduce training time.

You are working on a niche product in the image recognition domain. Your team has developed a model that is dominated by custom C++ TensorFlow ops your team has implemented. These ops are used inside your main training loop and are performing bulky matrix multiplications. It currently takes up to several days to train a model. You want to decrease this time significantly and keep the cost low by using an accelerator on Google Cloud. What should you do? A. Use Cloud TPUs without any additional adjustment to your code. B. Use Cloud TPUs after implementing GPU kernel support for your customs ops. C. Use Cloud GPUs after implementing GPU kernel support for your customs ops. D. Stay on CPUs, and increase the size of the cluster you're training your model on.

D. Stay on CPUs, and increase the size of the cluster you're training your model on. Reason: dominated by custom C++ TensorFlow ops means CPU. TPUs are only for Tensorflow with no custom ops.

You work on a regression problem in a natural language processing domain, and you have 100M labeled exmaples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model? A. Increase the share of the test sample in the train-test split. B. Try to collect more data and increase the size of your dataset. C. Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting. D. Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.

D. Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used. Reason: more error on train set means underfitting. B and C is for overfitting. A doesn't do much.

You use BigQuery as your centralized analytics platform. New data is loaded every day, and an ETL pipeline modifies the original data and prepares it for the final users. This ETL pipeline is regularly modified and can generate errors, but sometimes the errors are detected only after 2 weeks. You need to provide a method to recover from these errors, and your backups should be optimized for storage costs. How should you organize your data in BigQuery and store your backups? A. Organize your data in a single table, export, and compress and store the BigQuery data in Cloud Storage. B. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage. C. Organize your data in separate tables for each month, and duplicate your data on a separate dataset in BigQuery. D. Organize your data in separate tables for each month, and use snapshot decorators to restore the table to a time prior to the corruption.

B. Organize your data in separate tables for each month, and export, compress, and store the data in Cloud Storage. Reason: optimize for storage cost means GCS. D is wrong because snapshot decorator only allow to travel back 7 days.

The marketing team at your organization provides regular updates of a segment of your customer dataset. The marketing team has given you a CSV with 1 million records that must be updated in BigQuery. When you use the UPDATE statement in BigQuery, you receive a quotaExceeded error. What should you do? A. Reduce the number of records updated each day to stay within the BigQuery UPDATE DML statement limit. B. Increase the BigQuery UPDATE DML statement limit in the Quota management section of the Google Cloud Platform Console. C. Split the source CSV file into smaller CSV files in Cloud Storage to reduce the number of BigQuery UPDATE DML statements per BigQuery job. D. Import the new records from the CSV file into a new BigQuery table. Create a BigQuery job that merges the new records with the existing records and writes the results to a new BigQuery table.

D. Import the new records from the CSV file into a new BigQuery table. Create a BigQuery job that merges the new records with the existing records and writes the results to a new BigQuery table. Reason: BigQuery favor append-only. Update DML should not be used extensively. Thus, A, B, and C only delays the problem.

As your organization expands its usage of GCP, many teams have started to create their own projects. Projects are further multiplied to accommodate different stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects. Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access control management by minimizing the number of policies. Which two steps should you take? (Choose two.) A. Use Cloud Deployment Manager to automate access provision. B. Introduce resource hierarchy to leverage access control policy inheritance. C. Create distinct groups for various teams, and specify groups in Cloud IAM policies. D. Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets. E. For each Cloud Storage bucket or BigQuery dataset, decide which projects need access. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.

B. Introduce resource hierarchy to leverage access control policy inheritance. C. Create distinct groups for various teams, and specify groups in Cloud IAM policies. Reason: A is wrong because each project is unique in configuration. D is wrong because need user accounts. E is too manual.

Your United States-based company has created an application for assessing and responding to user actions. The primary table's data volume grows by 250,000 records per second. Many third parties use your application's APIs to build the functionality into their own frontend applications. Your application's APIs should comply with the following requirements: ✑ Single global endpoint ✑ ANSI SQL support ✑ Consistent access to the most up-to-date data What should you do? A. Implement BigQuery with no region selected for storage or processing. B. Implement Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe. C. Implement Cloud SQL for PostgreSQL with the master in Norht America and read replicas in Asia and Europe. D. Implement Cloud Bigtable with the primary cluster in North America and secondary clusters in Asia and Europe.

B. Implement Cloud Spanner with the leader in North America and read-only replicas in Asia and Europe. Reason: A is wrong because no region selected. C is wrong because there is a lot of data. D is wrong because needs SQL support.

A data scientist has created a BigQuery ML model and asks you to create an ML pipeline to serve predictions. You have a REST API application with the requirement to serve predictions for an individual user ID with latency under 100 milliseconds. You use the following query to generate predictions: SELECT predicted_label, user_id FROM ML.PREDICT (MODEL "˜dataset.model', table user_features). How should you create the ML pipeline? A. Add a WHERE clause to the query, and grant the BigQuery Data Viewer role to the application service account. B. Create an Authorized View with the provided query. Share the dataset that contains the view with the application service account. C. Create a Cloud Dataflow pipeline using BigQueryIO to read results from the query. Grant the Dataflow Worker role to the application service account. D. Create a Cloud Dataflow pipeline using BigQueryIO to read predictions for all users from the query. Write the results to Cloud Bigtable using BigtableIO. Grant the Bigtable Reader role to the application service account so that the application can read predictions for individual users from Cloud Bigtable.

D. Create a Cloud Dataflow pipeline using BigQueryIO to read predictions for all users from the query. Write the results to Cloud Bigtable using BigtableIO. Grant the Bigtable Reader role to the application service account so that the application can read predictions for individual users from Cloud Bigtable. Reason: To be under 100ms, query needs to be run already and store in Bigtable for fast access. Other options need to query which is slow.

You are building an application to share financial market data with consumers, who will receive data feeds. Data is collected from the markets in real time. Consumers will receive the data in the following ways: ✑ Real-time event stream ✑ ANSI SQL access to real-time stream and historical data ✑ Batch historical exports Which solution should you use? A. Cloud Dataflow, Cloud SQL, Cloud Spanner B. Cloud Pub/Sub, Cloud Storage, BigQuery C. Cloud Dataproc, Cloud Dataflow, BigQuery D. Cloud Pub/Sub, Cloud Dataproc, Cloud SQL

B. Cloud Pub/Sub, Cloud Storage, BigQuery | Reason: Pub/Sub for event stream, SQL for SQL, GCS for batch export

You are building a new application that you need to collect data from in a scalable way. Data arrives continuously from the application throughout the day, and you expect to generate approximately 150 GB of JSON data per day by the end of the year. Your requirements are: ✑ Decoupling producer from consumer ✑ Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely ✑ Near real-time SQL query ✑ Maintain at least 2 years of historical data, which will be queried with SQL Which pipeline should you use to meet these requirements? A. Create an application that provides an API. Write a tool to poll the API and write data to Cloud Storage as gzipped JSON files. B. Create an application that writes to a Cloud SQL database to store the data. Set up periodic exports of the database to write to Cloud Storage and load into BigQuery. C. Create an application that publishes events to Cloud Pub/Sub, and create Spark jobs on Cloud Dataproc to convert the JSON data to Avro format, stored on HDFS on Persistent Disk. D. Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery.

D. Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery. Reason: Pub/Sub to decouple. Dataflow is better than Dataproc.

You are running a pipeline in Cloud Dataflow that receives messages from a Cloud Pub/Sub topic and writes the results to a BigQuery dataset in the EU. Currently, your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization. Which two actions can you take to increase performance of your pipeline? (Choose two.) A. Increase the number of max workers B. Use a larger instance type for your Cloud Dataflow workers C. Change the zone of your Cloud Dataflow pipeline to run in us-central1 D. Create a temporary table in Cloud Bigtable that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Bigtable to BigQuery E. Create a temporary table in Cloud Spanner that will act as a buffer for new data. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Spanner to BigQuery

A. Increase the number of max workers B. Use a larger instance type for your Cloud Dataflow workers Reason: The current pipeline already use all resources.

You have a data pipeline with a Cloud Dataflow job that aggregates and writes time series metrics to Cloud Bigtable. This data feeds a dashboard used by thousands of users across the organization. You need to support additional concurrent users and reduce the amount of time required to write the data. Which two actions should you take? (Choose two.) A. Configure your Cloud Dataflow pipeline to use local execution B. Increase the maximum number of Cloud Dataflow workers by setting maxNumWorkers in PipelineOptions C. Increase the number of nodes in the Cloud Bigtable cluster D. Modify your Cloud Dataflow pipeline to use the Flatten transform before writing to Cloud Bigtable E. Modify your Cloud Dataflow pipeline to use the CoGroupByKey transform before writing to Cloud Bigtable

B. Increase the maximum number of Cloud Dataflow workers by setting maxNumWorkers in PipelineOptions C. Increase the number of nodes in the Cloud Bigtable cluster Reason: C is correct to support more concurrent users. B is correct to reduce write time. D and E are wrong because there is no mention of query patterns.

You have several Spark jobs that run on a Cloud Dataproc cluster on a schedule. Some of the jobs run in sequence, and some of the jobs run concurrently. You need to automate this process. What should you do? A. Create a Cloud Dataproc Workflow Template B. Create an initialization action to execute the jobs C. Create a Directed Acyclic Graph in Cloud Composer D. Create a Bash script that uses the Cloud SDK to create a cluster, execute jobs, and then tear down the cluster

C. Create a Directed Acyclic Graph in Cloud Composer

You are building a new data pipeline to share data between two different types of applications: jobs generators and job runners. Your solution must scale to accommodate increases in usage and must accommodate the addition of new applications without negatively affecting the performance of existing ones. What should you do? A. Create an API using App Engine to receive and send messages to the applications B. Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them C. Create a table on Cloud SQL, and insert and delete rows with the job information D. Create a table on Cloud Spanner, and insert and delete rows with the job information

B. Use a Cloud Pub/Sub topic to publish jobs, and use subscriptions to execute them Reason: Pub/Sub for decoupling

You need to create a new transaction table in Cloud Spanner that stores product sales data. You are deciding what to use as a primary key. From a performance perspective, which strategy should you choose? A. The current epoch time B. A concatenation of the product name and the current epoch time C. A random universally unique identifier number (version 4 UUID) D. The original order identification number from the sales system, which is a monotonically increasing integer

C. A random universally unique identifier number (version 4 UUID) Reason: Primary key in Cloud Spanner should avoid create hotspot which slow down the writes. Other options is monotonically increase and create hotspot

Data Analysts in your company have the Cloud IAM Owner role assigned to them in their projects to allow them to work with multiple GCP products in their projects. Your organization requires that all BigQuery data access logs be retained for 6 months. You need to ensure that only audit personnel in your company can access the data access logs for all projects. What should you do? A. Enable data access logs in each Data Analyst's project. Restrict access to Stackdriver Logging via Cloud IAM roles. B. Export the data access logs via a project-level export sink to a Cloud Storage bucket in the Data Analysts' projects. Restrict access to the Cloud Storage bucket. C. Export the data access logs via a project-level export sink to a Cloud Storage bucket in a newly created projects for audit logs. Restrict access to the project with the exported logs. D. Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs.

D. Export the data access logs via an aggregated export sink to a Cloud Storage bucket in a newly created project for audit logs. Restrict access to the project that contains the exported logs. Reason: Need a separate project for audit logs. Use aggregated export sink to get logs from all projects

Each analytics team in your organization is running BigQuery jobs in their own projects. You want to enable each team to monitor slot usage within their projects. What should you do? A. Create a Stackdriver Monitoring dashboard based on the BigQuery metric query/scanned_bytes B. Create a Stackdriver Monitoring dashboard based on the BigQuery metric slots/allocated_for_project C. Create a log export for each project, capture the BigQuery job execution logs, create a custom metric based on the totalSlotMs, and create a Stackdriver Monitoring dashboard based on the custom metric D. Create an aggregated log export at the organization level, capture the BigQuery job execution logs, create a custom metric based on the totalSlotMs, and create a Stackdriver Monitoring dashboard based on the custom metric

B. Create a Stackdriver Monitoring dashboard based on the BigQuery metric slots/allocated_for_project Reason: Use Stackdriver instead of custom metric. B is correct for slot allocated.

You are operating a streaming Cloud Dataflow pipeline. Your engineers have a new version of the pipeline with a different windowing algorithm and triggering strategy. You want to update the running pipeline with the new version. You want to ensure that no data is lost during the update. What should you do? A. Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to the existing job name B. Update the Cloud Dataflow pipeline inflight by passing the --update option with the --jobName set to a new unique job name C. Stop the Cloud Dataflow pipeline with the Cancel option. Create a new Cloud Dataflow job with the updated code D. Stop the Cloud Dataflow pipeline with the Drain option. Create a new Cloud Dataflow job with the updated code

D. Stop the Cloud Dataflow pipeline with the Drain option. Create a new Cloud Dataflow job with the updated code Reason: Stop job with drain will lose any data.

You need to move 2 PB of historical data from an on-premises storage appliance to Cloud Storage within six months, and your outbound network capacity is constrained to 20 Mb/sec. How should you migrate this data to Cloud Storage? A. Use Transfer Appliance to copy the data to Cloud Storage B. Use gsutil cp ""J to compress the content being uploaded to Cloud Storage C. Create a private URL for the historical data, and then use Storage Transfer Service to copy the data to Cloud Storage D. Use trickle or ionice along with gsutil cp to limit the amount of bandwidth gsutil utilizes to less than 20 Mb/sec so it does not interfere with the production traffic

A. Use Transfer Appliance to copy the data to Cloud Storage | Reason: Transfer Appliance to copy from on-premises

You receive data files in CSV format monthly from a third party. You need to cleanse this data, but every third month the schema of the files changes. Your requirements for implementing these transformations include: ✑ Executing the transformations on a schedule ✑ Enabling non-developer analysts to modify transformations ✑ Providing a graphical tool for designing transformations What should you do? A. Use Cloud Dataprep to build and maintain the transformation recipes, and execute them on a scheduled basis B. Load each month's CSV data into BigQuery, and write a SQL query to transform the data to a standard schema. Merge the transformed tables together with a SQL query C. Help the analysts write a Cloud Dataflow pipeline in Python to perform the transformation. The Python code should be stored in a revision control system and modified as the incoming data's schema changes D. Use Apache Spark on Cloud Dataproc to infer the schema of the CSV file before creating a Dataframe. Then implement the transformations in Spark SQL before writing the data out to Cloud Storage and loading into BigQuery

A. Use Cloud Dataprep to build and maintain the transformation recipes, and execute them on a scheduled basis Reason: Dataprep is graphical

You want to migrate an on-premises Hadoop system to Cloud Dataproc. Hive is the primary tool in use, and the data format is Optimized Row Columnar (ORC). All ORC files have been successfully copied to a Cloud Storage bucket. You need to replicate some data to the cluster's local Hadoop Distributed File System (HDFS) to maximize performance. What are two ways to start using Hive in Cloud Dataproc? (Choose two.) A. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to HDFS. Mount the Hive tables locally. B. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally. C. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS. D. Leverage Cloud Storage connector for Hadoop to mount the ORC files as external Hive tables. Replicate external Hive tables to the native ones. E. Load the ORC files into BigQuery. Leverage BigQuery connector for Hadoop to mount the BigQuery tables as external Hive tables. Replicate external Hive tables to the native ones.

B. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to any node of the Dataproc cluster. Mount the Hive tables locally. C. Run the gsutil utility to transfer all ORC files from the Cloud Storage bucket to the master node of the Dataproc cluster. Then run the Hadoop utility to copy them do HDFS. Mount the Hive tables from HDFS. Reason: Need gsutil to copy files. A is wrong since data needs to copy to node.

``` You are implementing several batch jobs that must be executed on a schedule. These jobs have many interdependent steps that must be executed in a specific order. Portions of the jobs involve executing shell scripts, running Hadoop jobs, and running queries in BigQuery. The jobs are expected to run for many minutes up to several hours. If the steps fail, they must be retried a fixed number of times. Which service should you use to manage the execution of these jobs? A. Cloud Scheduler B. Cloud Dataflow C. Cloud Functions D. Cloud Composer ```

D. Cloud Composer

You work for a shipping company that has distribution centers where packages move on delivery lines to route them properly. The company wants to add cameras to the delivery lines to detect and track any visual damage to the packages in transit. You need to create a way to automate the detection of damaged packages and flag them for human review in real time while the packages are in transit. Which solution should you choose? A. Use BigQuery machine learning to be able to train the model at scale, so you can analyze the packages in batches. B. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications. C. Use the Cloud Vision API to detect for damage, and raise an alert through Cloud Functions. Integrate the package tracking applications with this function. D. Use TensorFlow to create a model that is trained on your corpus of images. Create a Python notebook in Cloud Datalab that uses this model so you can analyze for damaged packages.

B. Train an AutoML model on your corpus of images, and build an API around that model to integrate with the package tracking applications. Reason: This is custom so needs AutoML. C is wrong because Cloud Vision cannot do that.

You are migrating your data warehouse to BigQuery. You have migrated all of your data into tables in a dataset. Multiple users from your organization will be using the data. They should only see certain tables based on their team membership. How should you set user permissions? A. Assign the users/groups data viewer access at the table level for each table B. Create SQL views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the SQL views C. Create authorized views for each team in the same dataset in which the data resides, and assign the users/groups data viewer access to the authorized views D. Create authorized views for each team in datasets created for each team. Assign the authorized views data viewer access to the dataset in which the data resides. Assign the users/groups data viewer access to the datasets in which the authorized views reside

D. Create authorized views for each team in datasets created for each team. Assign the authorized views data viewer access to the dataset in which the data resides. Assign the users/groups data viewer access to the datasets in which the authorized views reside Reason: Easier to manage as each team has their own dataset. Although A is possible now, it requires more bindings for each group and table. C works but also needs more bindings

You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do? A. Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory B. Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS C. Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up D. Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage

B. Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS Reason: Use native HDFS to avoid disk I/O