Exam - 6 Flashcards
You work for a manufacturing plant that batches application log files together into a single log file once a day at 2:00 AM. You have written a Google Cloud
Dataflow job to process that log file. You need to make sure the log file in processed once per day as inexpensively as possible. What should you do?
A. Change the processing job to use Google Cloud Dataproc instead.
B. Manually start the Cloud Dataflow job each morning when you get into the office.
C. Create a cron job with Google App Engine Cron Service to run the Cloud Dataflow job.
D. Configure the Cloud Dataflow job as a streaming job so that it processes the log data immediately.
Answer: C
You are building a teal-lime prediction engine that streams files, which may contain Pll (personal identifiable information) data, into Cloud Storage and eventually
into BigQuery You want to ensure that the sensitive data is masked but still maintains referential Integrity, because names and emails are often used as join keys
How should you use the Cloud Data Loss Prevention API (DLP API) to ensure that the Pll data is not accessible by unauthorized individuals?
A. Create a pseudonym by replacing the Pll data with cryptogenic tokens, and store the non-tokenized data in a locked-down button.
B. Redact all Pll data, and store a version of the unredacted data in a locked-down bucket
C. Scan every table in BigQuery, and mask the data it finds that has Pll
D. Create a pseudonym by replacing Pll data with a cryptographic format-preserving token
Answer: A
An online brokerage company requires a high volume trade processing architecture. You need to create a secure queuing system that triggers jobs. The jobs will
run in Google Cloud and cat the company’s Python API to execute trades. You need to efficiently implement a solution. What should you do?
A. Use Cloud Composer to subscribe to a Pub/Sub tope and can the Python API.
B. Use a Pub/Sub push subscription to trigger a Cloud Function to pass the data to tie Python API.
C. Write an application that makes a queue in a NoSQL database
D. Write an application hosted on a Compute Engine instance that makes a push subscription to the Pub/Sub topic
Answer: C
Your company is migrating its on-premises data warehousing solution to BigQuery. The existing data warehouse uses trigger-based change data capture (CDC) to
apply daily updates from transactional database sources Your company wants to use BigQuery to improve its handling of CDC and to optimize the performance of
the data warehouse Source system changes must be available for query m near-real time using tog-based CDC streams You need to ensure that changes in the
BigQuery reporting table are available with minimal latency and reduced overhead. What should you do? Choose 2 answers
A. Perform a DML INSERT UPDATE, or DELETE to replicate each CDC record in the reporting table m real time.
B. Periodically DELETE outdated records from the reporting tablePeriodically use a DML MERGE to simultaneously perform DML INSER
C. UPDATE, and DELETE operations in the reporting table
D. Insert each new CDC record and corresponding operation type into a staging table in real time
E. Insert each new CDC record and corresponding operation type into the reporting table in real time and use a materialized view to expose only the current
version of each unique record.
Answer: BD
You need (o give new website users a globally unique identifier (GUID) using a service that takes in data points and returns a GUID This data is sourced from both
internal and external systems via HTTP calls that you will make via microservices within your pipeline There will be tens of thousands of messages per second and
that can be multithreaded, and you worry about the backpressure on the system How should you design your pipeline to minimize that backpressure?
A. Call out to the service via HTTP
B. Create the pipeline statically in the class definition
C. Create a new object in the startBundle method of DoFn
D. Batch the job into ten-second increments
Answer: A
As your organization expands its usage of GCP, many teams have started to create their own projects. Projects are further multiplied to accommodate different
stages of deployments and target audiences. Each project requires unique access control configurations. The central IT team needs to have access to all projects.
Furthermore, data from Cloud Storage buckets and BigQuery datasets must be shared for use in other projects in an ad hoc way. You want to simplify access
control management by minimizing the number of policies. Which two steps should you take? Choose 2 answers.
A. Use Cloud Deployment Manager to automate access provision.
B. Introduce resource hierarchy to leverage access control policy inheritance.
C. Create distinct groups for various teams, and specify groups in Cloud IAM policies.
D. Only use service accounts when sharing data for Cloud Storage buckets and BigQuery datasets.
E. For each Cloud Storage bucket or BigQuery dataset, decide which projects need acces
F. Find all the active members who have access to these projects, and create a Cloud IAM policy to grant access to all these users.
Answer: AC
You want to optimize your queries for cost and performance. How should you structure your data?
A. Partition table data by create_date, location_id and device_version
B. Partition table data by create_date cluster table data by location_Id and device_version
C. Cluster table data by create_date location_id and device_version
D. Cluster table data by create_date partition by locationed and device_version
Answer: B
You’ve migrated a Hadoop job from an on-premises cluster to Dataproc and Good Storage. Your Spark job is a complex analytical workload fiat consists of many
shuffling operations, and initial data are parquet toes (on average 200-400 MB size each) You see some degradation in performance after the migration to
Dataproc so you’d like to optimize for it. Your organization is very cost-sensitive so you’d Idee to continue using Dataproc on preemptibles (with 2 non-preemptibles
workers only) for this workload. What should you do?
A. Switch from HODs to SSDs override the preemptible VMs configuration to increase the boot disk size
B. Increase the see of your parquet files to ensure them to be 1 GB minimum
C. Switch to TFRecords format (appr 200 MB per We) instead of parquet files
D. Switch from HDDs to SSD
E. copy initial data from Cloud Storage to Hadoop Distributed File System (HDFS) run the Spark job and copy results back to Cloud Storage
Answer: A
What are two of the characteristics of using online prediction rather than batch prediction?
A. It is optimized to handle a high volume of data instances in a job and to run more complex models.
B. Predictions are returned in the response message.
C. Predictions are written to output files in a Cloud Storage location that you specify.
D. It is optimized to minimize the latency of serving predictions
Answer: BD
You are running a pipeline in Cloud Dataflow that receives messages from a Cloud Pub/Sub topic and writes the results to a BigQuery dataset in the EU. Currently,
your pipeline is located in europe-west4 and has a maximum of 3 workers, instance type n1-standard-1. You notice that during peak periods, your pipeline is
struggling to process records in a timely fashion, when all 3 workers are at maximum CPU utilization. Which two actions can you take to increase performance of
your pipeline? (Choose two.)
A. Increase the number of max workers
B. Use a larger instance type for your Cloud Dataflow workers
C. Change the zone of your Cloud Dataflow pipeline to run in us-central1
D. Create a temporary table in Cloud Bigtable that will act as a buffer for new dat
E. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Bigtable to BigQuery
F. Create a temporary table in Cloud Spanner that will act as a buffer for new dat
G. Create a new step in your pipeline to write to this table first, and then create a new pipeline to write from Cloud Spanner to BigQuery
Answer: BE