Google Test Exam Flashcards
You are designing a streaming pipeline for ingesting player interaction data for a mobile game. You want the pipeline to handle out-of-order data delayed up to 15 minutes on a per-player basis and exponential growth in global users. What should you do?
A. Design a Dataflow streaming pipeline with session windowing and a minimum gap duration of 15 minutes. Use “individual player” as the key. Use Pub/Sub as a message bus for ingestion.
B. Design a Dataflow streaming pipeline with session windowing and a minimum gap duration of 15 minutes. Use “individual player” as the key. Use Apache Kafka as a message bus for ingestion.
C. Design a Dataflow streaming pipeline with a single global window of 15 minutes. Use Pub/Sub as a message bus for ingestion.
D. Design a Dataflow streaming pipeline with a single global window of 15 minutes. Use Apache Kafka as a message bus for ingestion.
Feedback
A is correct because the question requires delay be handled on a per-player basis and session windowing will do that. Pub/Sub handles the need to scale exponentially with traffic coming from around the globe.
B is not correct because Apache Kafka will not be able to handle an exponential growth in users globally as well as Pub/Sub.
C is not correct because a global window does not meet the requirements of handling out-of-order delay on a per-player basis.
D is not correct because a global window does not meet the requirements of handling out-of-order delay on a per-player basis.
You are building storage for files for a data pipeline on Google Cloud. You want to support JSON files. The schema of these files will occasionally change. Your analyst teams will use running aggregate ANSI SQL queries on this data. What should you do?
A. Use BigQuery for storage. Provide format files for data load. Update the format files as needed.
B. Use BigQuery for storage. Select “Automatically detect” in the Schema section.
C. Use Cloud Storage for storage. Link data as temporary tables in BigQuery and turn on the “Automatically detect” option in the Schema section of BigQuery.
D. Use Cloud Storage for storage. Link data as permanent tables in BigQuery and turn on the “Automatically detect” option in the Schema section of BigQuery.
Correct answer
B. Use BigQuery for storage. Select “Automatically detect” in the Schema section.
Feedback
A is not correct because you should not provide format files: you can simply turn on the ‘Automatically detect’ schema changes flag.
B is correct because of the requirement to support occasionally (schema) changing JSON files and aggregate ANSI SQL queries: you need to use BigQuery, and it is quickest to use ‘Automatically detect’ for schema changes.
C, D are not correct because you should not use Cloud Storage for this scenario: it is cumbersome and doesn’t add value.
You use a Hadoop cluster both for serving analytics and for processing and transforming data. The data is currently stored on HDFS in Parquet format. The data processing jobs run for 6 hours each night. Analytics users can access the system 24 hours a day. Phase 1 is to quickly migrate the entire Hadoop environment without a major re-architecture. Phase 2 will include migrating to BigQuery for analytics and to Dataflow for data processing. You want to make the future migration to BigQuery and Dataflow easier by following Google-recommended practices and managed services. What should you do?
A. Lift and shift Hadoop/HDFS to Dataproc.
B. Lift and shift Hadoop/HDFS to Compute Engine.
C. Create a single Dataproc cluster to support both analytics and data processing, and point it at a Cloud Storage bucket that contains the Parquet files that were previously stored on HDFS.
D. Create separate Dataproc clusters to support analytics and data processing, and point both at the same Cloud Storage bucket that contains the Parquet files that were previously stored on HDFS.
D. Create separate Dataproc clusters to support analytics and data processing, and point both at the same Cloud Storage bucket that contains the Parquet files that were previously stored on HDFS.
Feedback
A is not correct because it is not recommended to attach persistent HDFS to Dataproc clusters in Google Cloud. (see references link)
B Is not correct because they want to leverage managed services which would mean Dataproc.
C is not correct because it is recommended that Dataproc clusters be job specific.
D Is correct because it leverages a managed service (Dataproc), the data is stored on Cloud Storage in Parquet format which can easily be loaded into BigQuery in the future and the Dataproc clusters are job specific.
https://cloud.google.com/solutions/migration/hadoop/hadoop-gcp-migration-jobs
You are building a new real-time data warehouse for your company and will use BigQuery streaming inserts. There is no guarantee that data will only be sent in once but you do have a unique ID for each row of data and an event timestamp. You want to ensure that duplicates are not included while interactively querying data. Which query type should you use?
A. Include ORDER BY DESC on timestamp column and LIMIT to 1.
B. Use GROUP BY on the unique ID column and timestamp column and SUM on the values.
C. Use the LAG window function with PARTITION by unique ID along with WHERE LAG IS NOT NULL.
D. Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.
D. Use the ROW_NUMBER window function with PARTITION by unique ID along with WHERE row equals 1.
Feedback
A is not correct because this will just return one row.
B is not correct because this doesn’t get you the latest value, but will get you a sum of the same event over time which doesn’t make too much sense if you have duplicates.
C is not correct because if you have events that are not duplicated, it will be excluded.
D is correct because it will just pick out a single row for each set of duplicates.
Your company is loading comma-separated values (CSV) files into BigQuery. The data is fully imported successfully; however, the imported data is not matching byte-to-byte to the source file. What is the most likely cause of this problem?
A. The CSV data loaded in BigQuery is not flagged as CSV.
B. The CSV data had invalid rows that were skipped on import.
C. The CSV data loaded in BigQuery is not using BigQuery’s default encoding.
D. The CSV data has not gone through an ETL phase before loading into BigQuery.
Correct answer
C. The CSV data loaded in BigQuery is not using BigQuery’s default encoding.
Feedback
A is not correct because if another data format other than CSV was selected then the data would not import successfully.
B is not correct because the data was fully imported meaning no rows were skipped.
C is correct because this is the only situation that would cause successful import.
D is not correct because whether the data has been previously transformed will not affect whether the source file will match the BigQuery table.
https://cloud.google.com
Your company is migrating their 30-node Apache Hadoop cluster to the cloud. They want to re-use Hadoop jobs they have already created and minimize the management of the cluster as much as possible. They also want to be able to persist data beyond the life of the cluster. What should you do?
A. Create a Dataflow job to process the data.
B. Create a Dataproc cluster that uses persistent disks for HDFS.
C. Create a Hadoop cluster on Compute Engine that uses persistent disks.
D. Create a Dataproc cluster that uses the Cloud Storage connector.
E. Create a Hadoop cluster on Compute Engine that uses Local SSD disks.
Feedback
A is not correct because the goal is to re-use their Hadoop jobs and MapReduce and/or Spark jobs cannot simply be moved to Dataflow.
B is not correct because the goal is to persist the data beyond the life of the ephemeral clusters, and if HDFS is used as the primary attached storage mechanism, it will also disappear at the end of the cluster’s life.
C is not correct because the goal is to use managed services as much as possible, and this is the opposite.
D is correct because it uses managed services, and also allows for the data to persist on Cloud Storage beyond the life of the cluster.
E is not correct because the goal is to use managed services as much as possible, and this is the opposite.
You have 250,000 devices which produce a JSON device status event every 10 seconds. You want to capture this event data for outlier time series analysis. What should you do?
A. Ship the data into BigQuery. Develop a custom application that uses the BigQuery API to query the dataset and displays device outlier data based on your business requirements.
B. Ship the data into BigQuery. Use the BigQuery console to query the dataset and display device outlier data based on your business requirements.
C. Ship the data into Cloud Bigtable. Use the Cloud Bigtable cbt tool to display device outlier data based on your business requirements.
D. Ship the data into Cloud Bigtable. Install and use the HBase shell for Cloud Bigtable to query the table for device outlier data based on your business requirements.
Correct answer
C. Ship the data into Cloud Bigtable. Use the Cloud Bigtable cbt tool to display device outlier data based on your business requirements.
Feedback
A & B are not correct because you do not need to use BigQuery for the query pattern in this scenario.
C is correct because the data type, volume, and query pattern best fits BigTable capabilities and also Google best practices as linked below.
D is not correct because you can use the simpler method of ‘cbt tool’ to support this scenario.
https: //cloud.google.com/bigtable/docs/go/cbt-overview
https: //cloud.google.com/big
You are selecting a messaging service for log messages that must include final result message ordering as part of building a data pipeline on Google Cloud. You want to stream input for 5 days and be able to query the current status. You will be storing the data in a searchable repository. How should you set up the input messages?
A. Use Pub/Sub for input. Attach a timestamp to every message in the publisher.
B. Use Pub/Sub for input. Attach a unique identifier to every message in the publisher.
C. Use Apache Kafka on Compute Engine for input. Attach a timestamp to every message in the publisher.
D. Use Apache Kafka on Compute Engine for input. Attach a unique identifier to every message in the publisher.
Feedback
A is correct because of recommended Google practices; see the links below.
B is not correct because you should not attach a GUID to each message to support the scenario.
C & D are not correct because you should not use Apache Kafka for this scenario (it is overly complex compared to using Pub/Sub, which can support all of the requirements).
https: //cloud.google.com/pubsub/docs/ordering
http: //www.jesse-anderson
You want to publish system metrics to Google Cloud from a large number of on-prem hypervisors and VMs for analysis and creation of dashboards. You have an existing custom monitoring agent deployed to all the hypervisors and your on-prem metrics system is unable to handle the load. You want to design a system that can collect and store metrics at scale. You don’t want to manage your own time series database. Metrics from all agents should be written to the same table but agents must not have permission to modify or read data written by other agents.What should you do?
A. Modify the monitoring agent to publish protobuf messages to Pub/Sub. Use a Dataproc cluster or Dataflow job to consume messages from Pub/Sub and write to BigTable.
B. Modify the monitoring agent to write protobuf messages directly to BigTable.
C. Modify the monitoring agent to write protobuf messages to HBase deployed on Compute Engine VM Instances
D. Modify the monitoring agent to write protobuf messages to Pub/Sub. Use a Dataproc cluster or Dataflow job to consume messages from Pub/Sub and write to Cassandra deployed on Compute Engine VM Instances.
Feedback
A Is correct because Bigtable can store and analyze time series data, and the solution is using managed services which is what the requirements are calling for.
B Is not correct because BigTable cannot limit access to specific tables.
C is not correct because it requires deployment of an HBase cluster.
D is not correct because it requires deployment of a Cassandra cluster.
You are designing storage for CSV files and using an I/O-intensive custom Apache Spark transform as part of deploying a data pipeline on Google Cloud. You intend to use ANSI SQL to run queries for your analysts. How should you transform the input data?
A. Use BigQuery for storage. Use Dataflow to run the transformations.
B. Use BigQuery for storage. Use Dataproc to run the transformations.
C. Use Cloud Storage for storage. Use Dataflow to run the transformations.
D. Use Cloud Storage for storage. Use Dataproc to run the transformations.
Correct answer
B. Use BigQuery for storage. Use Dataproc to run the transformations.
Feedback
A is not correct because Cloud Dataflow does not support Spark.
B is correct because of the requirement to use custom Spark transforms; use Dataproc. ANSI SQL queries require the use of BigQuery.
C & D are not correct because Cloud Storage does not support SQL, and you should not use Dataflow, either.
You are designing a relational data repository on Google Cloud to grow as needed. The data will be transactionally consistent and added from any location in the world. You want to monitor and adjust node count for input traffic, which can spike unpredictably. What should you do?
A. Use Cloud Spanner for storage. Monitor storage usage and increase node count if more than 70% utilized.
B. Use Cloud Spanner for storage. Monitor CPU utilization and increase node count if more than 70% utilized for your time span.
C. Use Cloud Bigtable for storage. Monitor data stored and increase node count if more than 70% utilized.
D. Use Cloud Bigtable for storage. Monitor CPU utilization and increase node count if more than 70% utilized for your time span.
Correct answer
B. Use Cloud Spanner for storage. Monitor CPU utilization and increase node count if more than 70% utilized for your time span.
Feedback
A is not correct because you should not use storage utilization as a scaling metric.
B is correct because of the requirement to globally scalable transactions—use Cloud Spanner. CPU utilization is the recommended metric for scaling, per Google best practices, linked below.
C & D are not correct because you should not use Cloud Bigtable for this scenario.
https://cloud.google.com/bigtable/docs/monitoring-instance
https://cloud.google.com/sp
You have a Spark application that writes data to Cloud Storage in Parquet format. You scheduled the application to run daily using DataProcSparkOperator and Apache Airflow DAG by Cloud Composer. You want to add tasks to the DAG to make the data available to BigQuery users. You want to maximize query speed and configure partitioning and clustering on the table. What should you do?
A. Use “BashOperator” to call “bq insert”.
B. Use “BashOperator” to call “bq cp” with the “–append” flag.
C. Use “GoogleCloudStorageToBigQueryOperator” with “schema_object” pointing to a schema JSON in Cloud Storage and “source_format” set to “PARQUET”.
D. Use “BigQueryCreateExternalTableOperator” with “schema_object” pointing to a schema JSON in Cloud Storage and “source_format” set to “PARQUET”.
Feedback
A is not correct because bq insert will not set the partitioning and clustering and only supports JSON.
B is not correct because bq cp is for existing BigQuery tables only.
C is correct because it loads the data and sets partitioning and clustering.
D is not correct because an external table will not satisfy the query speed requirement.
https://cloud.google.com/bigquery/docs/loading-data
https://cloud.google.com/bigquery/docs/bq-command-line-tool
https://airflow.incubator.apache.org/integration.html#bigquerycreateemptytableoperator
https://airflow.incubator.apache.org/integration.html#googlecloudstoragetobigqueryoperator
https://cloud.google.com/bigquery/docs/reference/bq-cli-reference
You have a website that tracks page visits for each user and then creates a Pub/Sub message with the session ID and URL of the page. You want to create a Dataflow pipeline that sums the total number of pages visited by each user and writes the result to BigQuery. User sessions timeout after 30 minutes. Which type of Dataflow window should you choose?
A. A single global window
B. Fixed-time windows with a duration of 30 minutes
C. Session-based windows with a gap duration of 30 minutes
D. Sliding-time windows with a duration of 30 minutes and a new window every 5 minute
Feedback
A is incorrect because a user-specific sum is never calculated, it just sums for arbitrary 30-min windows of time staggered by 5 minutes.
B is incorrect because there is no per-user metric being used so it’s possible a sum will be created for some users while they are still browsing the site.
C is correct because it continues to sum user page visits during their browsing session and completes at the same time as the session timeout.
D is incorrect because if a user is still visiting the site when the 30-min window closes, the sum will be wrong.
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules: a). No interaction by the user on the site for 1 hour b). Has added more than $30 worth of products to the basket c). Has not completed a transaction. You use Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?
A. Use a fixed-time window with a duration of 60 minutes.
B. Use a sliding time window with a duration of 60 minutes.
C. Use a session window with a gap time duration of 60 minutes.
D. Use a global window with a time based trigger with a delay of 60 minutes.
Feedback
A is not correct because assuming there is one key per user, a message will be sent every 60 minutes.
B is not correct because assuming there is one key per user, a message will be sent 60 minutes after they first started browsing even if they are still browsing.
C is correct because it will send a message per user after that user is inactive for 60 minutes.
D is not correct because it will cause messages to be sent out every 60 minutes to all users regardless of where they are in their current session.
https://beam.apache.org/doc
You need to stream time-series data in Avro format, and then write this to both BigQuery and Cloud Bigtable simultaneously using Dataflow. You want to achieve minimal end-to-end latency. Your business requirements state this needs to be completed as quickly as possible. What should you do?
Create a pipeline and use ParDo transform.
Create a pipeline that groups the data into a PCollection and uses the Combine transform.
Create a pipeline that groups data using a PCollection and then uses Bigtable and BigQueryIO transforms.
Create a pipeline that groups data using a PCollection, and then use Avro I/O transform to write to Cloud Storage. After the data is written, load the data from Cloud Storage into BigQuery and Bigtable.
Feedback
A is not correct because ParDo doesn’t write to BigQuery or BigTable.
B is not correct because Combine doesn’t write to BigQuery or Bigtable.
C is correct because this is the right set of transformations that accepts and writes to the required data stores.
D is not correct because to meet the business requirements, it is much faster and easier using Dataflow (answer C).
https://cloud.google.com/b