Exam Topics 2 Flashcards by Anh Do

You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)
A. Get more training examples
B. Reduce the number of training examples
C. Use a smaller set of features
D. Use a larger set of features
E. Increase the regularization parameters
F. Decrease the regularization parameters

ACE

More train data, less features, more regularization

How well did you know this?

Not at all

Perfectly

You are implementing security best practices on your data pipeline. Currently, you are manually executing jobs as the Project Owner. You want to automate these jobs by taking nightly batch files containing non-public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud
Dataproc cluster, and depositing the results into Google BigQuery.
How should you securely run this workload?
A. Restrict the Google Cloud Storage bucket so only you can see the files
B. Grant the Project Owner role to a service account, and run the job with it
C. Use a service account with the ability to read the batch files and to write to BigQuery
D. Use a user account with the Project Viewer role on the Cloud Dataproc cluster to read the batch files and write to BigQuery

B. Grant the Project Owner role to a service account, and run the job with it
Reason: must use service account. However, C is missing the Dataproc permission. B is a bit too much permissions but it works.

How well did you know this?

Not at all

Perfectly

You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query:
SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country
You check the query plan for the query and see the following output in the Read section of Stage:1:
(A bar with 25% blue, then 75% purple)
What is the most likely cause of the delay for this query?
A. Users are running too many concurrent queries in the system
B. The [myproject:mydataset.mytable] table has too many partitions
C. Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values
D. Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew

D. Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew
Reason: Data skew because 75% time doesn’t do anything

How well did you know this?

Not at all

Perfectly

Your globally distributed auction application allows users to bid on items. Occasionally, users place identical bids at nearly identical times, and different application servers process those bids. Each bid event contains the item, amount, user, and timestamp. You want to collate those bid events into a single location in real time to determine which user bid first. What should you do?
A. Create a file on a shared file and have the application servers write all bid events to that file. Process the file with Apache Hadoop to identify which user bid first.
B. Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud SQL.
C. Set up a MySQL database for each application server to write bid events into. Periodically query each of those distributed MySQL databases and update a master MySQL database with bid event information.
D. Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in the bid event that is processed first.

B. Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud SQL.
Reason: Need Pub/Sub for this use case. Real time requirement requires push subscription. CloudSQL also allows to sort who is first.

How well did you know this?

Not at all

Perfectly

Your organization has been collecting and analyzing data in Google BigQuery for 6 months. The majority of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the cost of queries, your organization created a view called events, which queries only the last 14 days of data. The view is described in legacy SQL. Next month, existing applications will be connecting to BigQuery to read the events data via an ODBC connection. You need to ensure the applications can connect. Which two actions should you take? (Choose two.)
A. Create a new view over events using standard SQL
B. Create a new partitioned table using a standard SQL query
C. Create a new view over events_partitioned using standard SQL
D. Create a service account for the ODBC connection to use for authentication
E. Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection and shared “events”

Reason: need standard SQL over the original table, and service account to run it

How well did you know this?

Not at all

Perfectly

You have enabled the free integration between Firebase Analytics and Google BigQuery. Firebase now automatically creates a new table daily in BigQuery in the format app_events_YYYYMMDD. You want to query all of the tables for the past 30 days in legacy SQL. What should you do?
A. Use the TABLE_DATE_RANGE function
B. Use the WHERE_PARTITIONTIME pseudo column
C. Use WHERE date BETWEEN YYYY-MM-DD AND YYYY-MM-DD
D. Use SELECT IF.(date >= YYYY-MM-DD AND date <= YYYY-MM-DD

A. Use the TABLE_DATE_RANGE function

Reason: This function allows query multiple tables generated by date range function

How well did you know this?

Not at all

Perfectly

Your company is currently setting up data pipelines for their campaign. For all the Google Cloud Pub/Sub streaming data, one of the important business requirements is to be able to periodically identify the inputs and their timings during their campaign. Engineers have decided to use windowing and transformation in Google Cloud Dataflow for this purpose. However, when testing this feature, they find that the Cloud Dataflow job fails for the all streaming insert. What is the most likely cause of this problem?
A. They have not assigned the timestamp, which causes the job to fail
B. They have not set the triggers to accommodate the data coming in late, which causes the job to fail
C. They have not applied a global windowing function, which causes the job to fail when the pipeline is created
D. They have not applied a non-global windowing function, which causes the job to fail when the pipeline is created

D. They have not applied a non-global windowing function, which causes the job to fail when the pipeline is created
Reason: Beam default behavior set a global windowing function so C is incorrect. A and B are not necessary to cause errors.

How well did you know this?

Not at all

Perfectly

You architect a system to analyze seismic data. Your extract, transform, and load (ETL) process runs as a series of MapReduce jobs on an Apache Hadoop cluster. The ETL process takes days to process a data set because some steps are computationally expensive. Then you discover that a sensor calibration step has been omitted. How should you change your ETL process to carry out sensor calibration systematically in the future?
A. Modify the transformMapReduce jobs to apply sensor calibration before they do anything else.
B. Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.
C. Add sensor calibration data to the output of the ETL process, and document that all users need to apply sensor calibration themselves.
D. Develop an algorithm through simulation to predict variance of data output from the last MapReduce job based on calibration factors, and apply the correction to all data.

B. Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.
Reason: Cleaner approach than A and doesn’t require changes to existing jobs

How well did you know this?

Not at all

Perfectly

An online retailer has built their current application on Google App Engine. A new initiative at the company mandates that they extend their application to allow their customers to transact directly via the application. They need to manage their shopping transactions and analyze combined data from multiple datasets using a business intelligence (BI) tool. They want to use only a single database for this purpose. Which Google Cloud database should they choose?
A. BigQuery
B. Cloud SQL
C. Cloud BigTable
D. Cloud Datastore

B. Cloud SQL

Reason: transactional database and still support BI connector

How well did you know this?

Not at all

Perfectly

You launched a new gaming app almost three years ago. You have been uploading log files from the previous day to a separate Google BigQuery table with the table name format LOGS_yyyymmdd. You have been using table wildcard functions to generate daily and monthly reports for all time ranges. Recently, you discovered that some queries that cover long date ranges are exceeding the limit of 1,000 tables and failing. How can you resolve this issue?
A. Convert all daily log tables into date-partitioned tables
B. Convert the sharded tables into a single partitioned table
C. Enable query caching so you can cache data from previous months
D. Create separate views to cover each month, and query from these views

A. Convert all daily log tables into date-partitioned tables
Reason: Maximum partition is 4000 which should be enough. C and D doesn’t solve the actual problem. B just has 1 partition which is not good for query.

How well did you know this?

Not at all

Perfectly

Your analytics team wants to build a simple statistical model to determine which customers are most likely to work with your company again, based on a few different metrics. They want to run the model on Apache Spark, using data housed in Google Cloud Storage, and you have recommended using Google Cloud
Dataproc to execute this job. Testing has shown that this workload can run in approximately 30 minutes on a 15-node cluster, outputting the results into Google
BigQuery. The plan is to run this workload weekly. How should you optimize the cluster for cost?
A. Migrate the workload to Google Cloud Dataflow
B. Use pre-emptible virtual machines (VMs) for the cluster
C. Use a higher-memory node so that the job runs faster
D. Use SSDs on the worker nodes so that the job can run faster

B. Use pre-emptible virtual machines (VMs) for the cluster

Reason: save on cost. since the job only runs 30 minutes every week

How well did you know this?

Not at all

Perfectly

Your company receives both batch- and stream-based event data. You want to process the data using Google Cloud Dataflow over a predictable time period.
However, you realize that in some instances data can arrive late or out of order. How should you design your Cloud Dataflow pipeline to handle data that is late or out of order?
A. Set a single global window to capture all the data.
B. Set sliding windows to capture all the lagged data.
C. Use watermarks and timestamps to capture the lagged data.
D. Ensure every datasource type (stream or batch) has a timestamp, and use the timestamps to define the logic for lagged data.

C. Use watermarks and timestamps to capture the lagged data.

Reason: watermark is for out of order data

How well did you know this?

Not at all

Perfectly

You have some data, which is shown in the graphic below. The two dimensions are X and Y, and the shade of each dot represents what class it is. You want to classify this data accurately using a linear algorithm. To do this you need to add a synthetic feature. What should the value of that feature be?

The graph has 2 groups: smaller circle and large circle

A. X^2+Y^2
B. X^2
C. Y^2
D. cos(X)

A. X^2+Y^2

Reason: this is the circle radius. so linear algorithm can be used

How well did you know this?

Not at all

Perfectly

You are integrating one of your internal IT applications and Google BigQuery, so users can query BigQuery from the application’s interface. You do not want individual users to authenticate to BigQuery and you do not want to give them access to the dataset. You need to securely access BigQuery from your IT application. What should you do?
A. Create groups for your users and give those groups access to the dataset
B. Integrate with a single sign-on (SSO) platform, and pass each user’s credentials along with the query request
C. Create a service account and grant dataset access to that account. Use the service account’s private key to access the dataset
D. Create a dummy user and grant dataset access to that user. Store the username and password for that user in a file on the files system, and use those credentials to access the BigQuery dataset

C. Create a service account and grant dataset access to that account. Use the service account’s private key to access the dataset
Reason: Service account is the approach for the app so no user needed

How well did you know this?

Not at all

Perfectly

You are building a data pipeline on Google Cloud. You need to prepare data using a casual method for a machine-learning process. You want to support a logistic regression model. You also need to monitor and adjust for null values, which must remain real-valued and cannot be removed. What should you do?
A. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to “˜none’ using a Cloud Dataproc job.
B. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a Cloud Dataprep job.
C. Use Cloud Dataflow to find null values in sample source data. Convert all nulls to “˜none’ using a Cloud Dataprep job.
D. Use Cloud Dataflow to find null values in sample source data. Convert all nulls to 0 using a custom script.

B. Use Cloud Dataprep to find null values in sample source data. Convert all nulls to 0 using a Cloud Dataprep job.
Reason: Dataprep can handle simple transformation. And real-valued mean 0, not “none”

How well did you know this?

Not at all

Perfectly

You set up a streaming data insert into a Redis cluster via a Kafka cluster. Both clusters are running on Compute Engine instances. You need to encrypt data at rest with encryption keys that you can create, rotate, and destroy as needed. What should you do?
A. Create a dedicated service account, and use encryption at rest to reference your data stored in your Compute Engine cluster instances as part of your API service calls.
B. Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
C. Create encryption keys locally. Upload your encryption keys to Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
D. Create encryption keys in Cloud Key Management Service. Reference those keys in your API service calls when accessing the data in your Compute Engine cluster instances.

B. Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.
Reason: Cloud Key Management Service should be used instead of locally. To encrypt data at rest, API service calls should not be used.

How well did you know this?

Not at all

Perfectly

You are developing an application that uses a recommendation engine on Google Cloud. Your solution should display new videos to customers based on past views. Your solution needs to generate labels for the entities in videos that the customer has viewed. Your design must be able to provide very fast filtering suggestions based on data from other customer preferences on several TB of data. What should you do?
A. Build and train a complex classification model with Spark MLlib to generate labels and filter the results. Deploy the models using Cloud Dataproc. Call the model from your application.
B. Build and train a classification model with Spark MLlib to generate labels. Build and train a second classification model with Spark MLlib to filter results to match customer preferences. Deploy the models using Cloud Dataproc. Call the models from your application.
C. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud Bigtable, and filter the predicted labels to match the user’s viewing history to generate preferences.
D. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud SQL, and join and filter the predicted labels to match the user’s viewing history to generate preferences.

C. Build an application that calls the Cloud Video Intelligence API to generate labels. Store data in Cloud Bigtable, and filter the predicted labels to match the user’s viewing history to generate preferences.
Reason: Use built-in Cloud Video Intelligence API. Due to the huge amount of data, BigTable is needed.

How well did you know this?

Not at all

Perfectly

You are selecting services to write and transform JSON messages from Cloud Pub/Sub to BigQuery for a data pipeline on Google Cloud. You want to minimize service costs. You also want to monitor and accommodate input data volume that will vary in size with minimal manual intervention. What should you do?
A. Use Cloud Dataproc to run your transformations. Monitor CPU utilization for the cluster. Resize the number of worker nodes in your cluster via the command line.
B. Use Cloud Dataproc to run your transformations. Use the diagnose command to generate an operational output archive. Locate the bottleneck and adjust cluster resources.
C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
D. Use Cloud Dataflow to run your transformations. Monitor the total execution time for a sampling of jobs. Configure the job to use non-default Compute Engine machine types when needed.

C. Use Cloud Dataflow to run your transformations. Monitor the job system lag with Stackdriver. Use the default autoscaling setting for worker instances.
Reason: Use built-in service

How well did you know this?

Not at all

Perfectly

Your infrastructure includes a set of YouTube channels. You have been tasked with creating a process for sending the YouTube channel data to Google Cloud for analysis. You want to design a solution that allows your world-wide marketing teams to perform ANSI SQL and other types of analysis on up-to-date YouTube channels log data. How should you set up the log data transfer into Google Cloud?
A. Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
B. Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Regional bucket as a final destination.
C. Use BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
D. Use BigQuery Data Transfer Service to transfer the offsite backup files to a Cloud Storage Regional storage bucket as a final destination.

A. Use Storage Transfer Service to transfer the offsite backup files to a Cloud Storage Multi-Regional storage bucket as a final destination.
Reason: To move data to GCS use Storage Transfer Service. Requirement world-wide means multi-regional. C and D are incorrect because GCS is the destination not BigQuery.

How well did you know this?

Not at all

Perfectly

You are designing storage for very large text files for a data pipeline on Google Cloud. You want to support ANSI SQL queries. You also want to support compression and parallel load from the input locations using Google recommended practices. What should you do?
A. Transform text files to compressed Avro using Cloud Dataflow. Use BigQuery for storage and query.
B. Transform text files to compressed Avro using Cloud Dataflow. Use Cloud Storage and BigQuery permanent linked tables for query.
C. Compress text files to gzip using the Grid Computing Tools. Use BigQuery for storage and query.
D. Compress text files to gzip using the Grid Computing Tools. Use Cloud Storage, and then import into Cloud Bigtable for query.

B. Transform text files to compressed Avro using Cloud Dataflow. Use Cloud Storage and BigQuery permanent linked tables for query.
Reason: Avro is recommend format. Both A and B is correct but GCS is recommended for saving cost.

How well did you know this?

Not at all

Perfectly

You are developing an application on Google Cloud that will automatically generate subject labels for users’ blog posts. You are under competitive pressure to add this feature quickly, and you have no additional developer resources. No one on your team has experience with machine learning. What should you do?
A. Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.
B. Call the Cloud Natural Language API from your application. Process the generated Sentiment Analysis as labels.
C. Build and train a text classification model using TensorFlow. Deploy the model using Cloud Machine Learning Engine. Call the model from your application and process the results as labels.
D. Build and train a text classification model using TensorFlow. Deploy the model using a Kubernetes Engine cluster. Call the model from your application and process the results as labels.

A. Call the Cloud Natural Language API from your application. Process the generated Entity Analysis as labels.
Reason: No resource so should use pre-trained model

You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud. Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which storage service and schema design should you use?
A. Use Cloud Bigtable for storage. Install the HBase shell on a Compute Engine instance to query the Cloud Bigtable data.
B. Use Cloud Bigtable for storage. Link as permanent tables in BigQuery for query.
C. Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.
D. Use Cloud Storage for storage. Link as temporary tables in BigQuery for query.

C. Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.
Reason: BigTable is expensive. Permanent tables are faster to query.

You are designing storage for two relational tables that are part of a 10-TB database on Google Cloud. You want to support transactions that scale horizontally.
You also want to optimize data for range queries on non-key columns. What should you do?
A. Use Cloud SQL for storage. Add secondary indexes to support query patterns.
B. Use Cloud SQL for storage. Use Cloud Dataflow to transform data to support query patterns.
C. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
D. Use Cloud Spanner for storage. Use Cloud Dataflow to transform data to support query patterns.

C. Use Cloud Spanner for storage. Add secondary indexes to support query patterns.
Reason: Cloud Spanner for scaling. Add indexes to query non-key columns

Your financial services company is moving to cloud technology and wants to store 50 TB of financial time-series data in the cloud. This data is updated frequently and new data will be streaming in all the time. Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data.
Which product should they use to store the data?
A. Cloud Bigtable
B. Google BigQuery
C. Google Cloud Storage
D. Google Cloud Datastore

A. Cloud Bigtable

Reason: Support large timeseries data

An organization maintains a Google BigQuery dataset that contains tables with user-level data. They want to expose aggregates of this data to other Google Cloud projects, while still controlling access to the user-level data. Additionally, they need to minimize their overall storage cost and ensure the analysis cost for other projects is assigned to those projects. What should they do? A. Create and share an authorized view that provides the aggregate results. B. Create and share a new dataset and view that provides the aggregate results. C. Create and share a new dataset and table that contains the aggregate results. D. Create dataViewer Identity and Access Management (IAM) roles on the dataset to enable sharing.

A. Create and share an authorized view that provides the aggregate results. Reason: Authorized view

Government regulations in your industry mandate that you have to maintain an auditable record of access to certain types of data. Assuming that all expiring logs will be archived correctly, where should you store data that is subject to that mandate? A. Encrypted on Cloud Storage with user-supplied encryption keys. A separate decryption key will be given to each authorized user. B. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability. C. In Cloud SQL, with separate database user names to each user. The Cloud SQL Admin activity logs will be used to provide the auditability. D. In a bucket on Cloud Storage that is accessible only by an AppEngine service that collects user information and logs the access before providing a link to the bucket.

B. In a BigQuery dataset that is viewable only by authorized personnel, with the Data Access log used to provide the auditability. Reason: A and D are too complicated and hard to audit. C is wrong.

Your neural network model is taking days to train. You want to increase the training speed. What can you do? A. Subsample your test dataset. B. Subsample your training dataset. C. Increase the number of input features to your model. D. Increase the number of layers in your neural network.

B. Subsample your training dataset. | Reason: Subsample reduce training points

``` You are responsible for writing your company's ETL pipelines to run on an Apache Hadoop cluster. The pipeline will require some checkpointing and splitting pipelines. Which method should you use to write the pipelines? A. PigLatin using Pig B. HiveQL using Hive C. Java using MapReduce D. Python using MapReduce ```

A. PigLatin using Pig

Your company maintains a hybrid deployment with GCP, where analytics are performed on your anonymized customer data. The data are imported to Cloud Storage from your data center through parallel uploads to a data transfer server running on GCP. Management informs you that the daily transfers take too long and have asked you to fix the problem. You want to maximize transfer speeds. Which action should you take? A. Increase the CPU size on your server. B. Increase the size of the Google Persistent Disk on your server. C. Increase your network bandwidth from your datacenter to GCP. D. Increase your network bandwidth from Compute Engine to Cloud Storage.

C. Increase your network bandwidth from your datacenter to GCP.

MJTelco Case Study - Company Overview - MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware. Company Background - Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost. Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs. Solution Concept - MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs: ✑ Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations. ✑ Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition. MJTelco will also use three separate operating environments "" development/test, staging, and production "" to meet the needs of running experiments, deploying new features, and serving production customers. Business Requirements - ✑ Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community. ✑ Ensure security of their proprietary data to protect their leading-edge machine learning and analysis. ✑ Provide reliable and timely access to data for analysis from distributed research workers ✑ Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers. Technical Requirements - Ensure secure and efficient transport and storage of telemetry data Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each. Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles. CEO Statement - Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments. CTO Statement - Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate. CFO Statement - The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines. MJTelco is building a custom interface to share data. They have these requirements: 1. They need to do aggregations over their petabyte-scale datasets. 2. They need to scan specific time range rows with a very fast response time (milliseconds). Which combination of Google Cloud Platform products should you recommend? A. Cloud Datastore and Cloud Bigtable B. Cloud Bigtable and Cloud SQL C. BigQuery and Cloud Bigtable D. BigQuery and Cloud Storage

C. BigQuery and Cloud Bigtable | Reason: Lots of data so Bigtable. Aggregate with BigQuery

MJTelco Case Study - Company Overview - MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware. Company Background - Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost. Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs. Solution Concept - MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs: ✑ Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations. ✑ Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition. MJTelco will also use three separate operating environments "" development/test, staging, and production "" to meet the needs of running experiments, deploying new features, and serving production customers. Business Requirements - ✑ Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community. ✑ Ensure security of their proprietary data to protect their leading-edge machine learning and analysis. ✑ Provide reliable and timely access to data for analysis from distributed research workers ✑ Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers. Technical Requirements - Ensure secure and efficient transport and storage of telemetry data Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each. Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles. CEO Statement - Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments. CTO Statement - Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate. CFO Statement - The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines. You need to compose visualization for operations teams with the following requirements: ✑ Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute) ✑ The report must not be more than 3 hours delayed from live data. ✑ The actionable report should only show suboptimal links. ✑ Most suboptimal links should be sorted to the top. ✑ Suboptimal links can be grouped and filtered by regional geography. ✑ User response time to load the report must be <5 seconds. You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do? A. Look through the current data and compose a series of charts and tables, one for each possible combination of criteria. B. Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection. C. Export the data to a spreadsheet, compose a series of charts and tables, one for each possible combination of criteria, and spread them across multiple tabs. D. Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.

B. Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection. Reason: C and D are too complex. A is not necessary to have all combinations.

MJTelco Case Study - Company Overview - MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world. The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware. Company Background - Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost. Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs. Solution Concept - MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs: ✑ Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations. ✑ Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition. MJTelco will also use three separate operating environments "" development/test, staging, and production "" to meet the needs of running experiments, deploying new features, and serving production customers. Business Requirements - ✑ Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community. ✑ Ensure security of their proprietary data to protect their leading-edge machine learning and analysis. ✑ Provide reliable and timely access to data for analysis from distributed research workers ✑ Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers. Technical Requirements - Ensure secure and efficient transport and storage of telemetry data Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each. Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles. CEO Statement - Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments. CTO Statement - Our public cloud services must operate as advertised. We need resources that scale and keep our data secure. We also need environments in which our data scientists can carefully study and quickly adapt our models. Because we rely on automation to process our data, we also need our development and test environments to work as we iterate. CFO Statement - The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines. Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day's events. They also want to use streaming ingestion. What should you do? A. Create a table called tracking_table and include a DATE column. B. Create a partitioned table called tracking_table and include a TIMESTAMP column. C. Create sharded tables for each day following the pattern tracking_table_YYYYMMDD. D. Create a table called tracking_table with a TIMESTAMP column to represent the day.

B. Create a partitioned table called tracking_table and include a TIMESTAMP column. Reason: Limit the read amount only for the day of interest. C is ok but the requirement is one large table.

Flowlogistic Case Study - Company Overview - Flowlogistic is a leading logistics and supply chain provider. They help businesses throughout the world manage their resources and transport them to their final destination. The company has grown rapidly, expanding their offerings to include rail, truck, aircraft, and oceanic shipping. Company Background - The company started as a regional trucking company, and then expanded into other logistics market. Because they have not updated their infrastructure, managing and tracking orders and shipments has become a bottleneck. To improve operations, Flowlogistic developed proprietary technology for tracking shipments in real time at the parcel level. However, they are unable to deploy it because their technology stack, based on Apache Kafka, cannot support the processing volume. In addition, Flowlogistic wants to further analyze their orders and shipments to determine how best to deploy their resources. Solution Concept - Flowlogistic wants to implement two concepts using the cloud: ✑ Use their proprietary technology in a real-time inventory-tracking system that indicates the location of their loads ✑ Perform analytics on all their orders and shipment logs, which contain both structured and unstructured data, to determine how best to deploy resources, which markets to expand info. They also want to use predictive analytics to learn earlier when a shipment will be delayed. Existing Technical Environment - Flowlogistic architecture resides in a single data center: ✑ Databases - 8 physical servers in 2 clusters - SQL Server "" user data, inventory, static data - 3 physical servers - Cassandra "" metadata, tracking messages 10 Kafka servers "" tracking message aggregation and batch insert ✑ Application servers "" customer front end, middleware for order/customs - 60 virtual machines across 20 physical servers - Tomcat "" Java services - Nginx "" static content - Batch servers ✑ Storage appliances - iSCSI for virtual machine (VM) hosts - Fibre Channel storage area network (FC SAN) "" SQL server storage Network-attached storage (NAS) image storage, logs, backups ✑ 10 Apache Hadoop /Spark servers - Core Data Lake - Data analysis workloads ✑ 20 miscellaneous servers - Jenkins, monitoring, bastion hosts, Business Requirements - Build a reliable and reproducible environment with scaled panty of production. ✑ Aggregate data in a centralized Data Lake for analysis ✑ Use historical data to perform predictive analytics on future shipments ✑ Accurately track every shipment worldwide using proprietary technology ✑ Improve business agility and speed of innovation through rapid provisioning of new resources ✑ Analyze and optimize architecture for performance in the cloud ✑ Migrate fully to the cloud if all other requirements are met Technical Requirements - ✑ Handle both streaming and batch data ✑ Migrate existing Hadoop workloads ✑ Ensure architecture is scalable and elastic to meet the changing demands of the company. ✑ Use managed services whenever possible ✑ Encrypt data flight and at rest Connect a VPN between the production data center and cloud environment SEO Statement - We have grown so quickly that our inability to upgrade our infrastructure is really hampering further growth and efficiency. We are efficient at moving shipments around the world, but we are inefficient at moving data around. We need to organize our information so we can more easily understand where our customers are and what they are shipping. CTO Statement - IT has never been a priority for us, so as our data has grown, we have not invested enough in our technology. I have a good staff to manage IT, but they are so busy managing our infrastructure that I cannot get them to do the things that really matter, such as organizing our data, building the analytics, and figuring out how to implement the CFO' s tracking technology. CFO Statement - Part of our competitive advantage is that we penalize ourselves for late shipments and deliveries. Knowing where out shipments are at all times has a direct correlation to our bottom line and profitability. Additionally, I don't want to commit capital to building out a server environment. Flowlogistic's management has determined that the current Apache Kafka servers cannot handle the data volume for their real-time inventory tracking system. You need to build a new system on Google Cloud Platform (GCP) that will feed the proprietary tracking software. The system must be able to ingest data from a variety of global sources, process and query in real-time, and store the data reliably. Which combination of GCP products should you choose? A. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage B. Cloud Pub/Sub, Cloud Dataflow, and Local SSD C. Cloud Pub/Sub, Cloud SQL, and Cloud Storage D. Cloud Load Balancing, Cloud Dataflow, and Cloud Storage E. Cloud Dataflow, Cloud SQL, and Cloud Storage

A. Cloud Pub/Sub, Cloud Dataflow, and Cloud Storage | Reason: Recommended approach

After migrating ETL jobs to run on BigQuery, you need to verify that the output of the migrated jobs is the same as the output of the original. You've loaded a table containing the output of the original job and want to compare the contents with output from the migrated job to show that they are identical. The tables do not contain a primary key column that would enable you to join them together for comparison. What should you do? A. Select random samples from the tables using the RAND() function and compare the samples. B. Select random samples from the tables using the HASH() function and compare the samples. C. Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table. D. Create stratified random samples using the OVER() function and compare equivalent samples from each table.

C. Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table. Reason: Not to confirm everything is identical, not just random spot check

You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don't get slots to execute their query and you need to correct this. You'd like to avoid introducing new projects to your account. What should you do? A. Convert your batch BQ queries into interactive BQ queries. B. Create an additional project to overcome the 2K on-demand per-project quota. C. Switch to flat-rate pricing and establish a hierarchical priority model for your projects. D. Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.

C. Switch to flat-rate pricing and establish a hierarchical priority model for your projects. Reason: This is supported by BigQuery, and usually uses for large companies. A is totally wrong. B contradicts the requirements. D is not fixing the root cause.

You have an Apache Kafka cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins. What should you do? A. Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS. B. Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS. C. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connector. Use a Dataflow job to read from PubSub and write to GCS. D. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connector. Use a Dataflow job to read from PubSub and write to GCS.

A. Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS. Reason: Use GCP instead of on-prem so C and D is out. B is out because of the requirements of no Kafka connector

You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload. What should you do? A. Increase the size of your parquet files to ensure them to be 1 GB minimum. B. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files. C. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS. D. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.

A. Increase the size of your parquet files to ensure them to be 1 GB minimum. Reason: C and D are costly because of SSD. B is not making any big differences.

Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data). What should you do? A. Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs. B. Add a try… catch block to your DoFn that transforms the data, extract erroneous rows from logs. C. Add a try… catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn. D. Add a try… catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.

C. Add a try… catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn. Reason: Need to keep failing data for reprocessing so A and B are wrong. C is more real time than D.

You're training a model to predict housing prices based on an available dataset with real estate properties. Your plan is to train a fully connected neural net, and you've discovered that the dataset contains latitude and longitude of the property. Real estate professionals have told you that the location of the property is highly influential on price, so you'd like to engineer a feature that incorporates this physical dependency. What should you do? A. Provide latitude and longitude as input vectors to your neural net. B. Create a numeric column from a feature cross of latitude and longitude. C. Create a feature cross of latitude and longitude, bucketize at the minute level and use L1 regularization during optimization. D. Create a feature cross of latitude and longitude, bucketize it at the minute level and use L2 regularization during optimization.

C. Create a feature cross of latitude and longitude, bucketize at the minute level and use L1 regularization during optimization. Reason: A is out because need a single feature. From description, this feature is more important, so should use L1 Use L1 regularization when you need to assign greater importance to more influential features. It shrinks less important feature to 0. L2 regularization performs better when all input features influence the output & all with the weights are of equal size.

You are deploying MariaDB SQL databases on GCE VM Instances and need to configure monitoring and alerting. You want to collect metrics including network connections, disk IO and replication status from MariaDB with minimal development effort and use StackDriver for dashboards and alerts. What should you do? A. Install the OpenCensus Agent and create a custom metric collection application with a StackDriver exporter. B. Place the MariaDB instances in an Instance Group with a Health Check. C. Install the StackDriver Logging Agent and configure fluentd in_tail plugin to read MariaDB logs. D. Install the StackDriver Agent and configure the MySQL plugin.

D. Install the StackDriver Agent and configure the MySQL plugin. Reason: MySQL is compatible with MariaDB. A and B are out because should use GCP

You work for a bank. You have a labelled dataset that contains information on already granted loan application and whether these applications have been defaulted. You have been asked to train a model to predict default rates for credit applicants. What should you do? A. Increase the size of the dataset by collecting additional data. B. Train a linear regression to predict a credit default risk score. C. Remove the bias from the data and collect applications that have been declined loans. D. Match loan applicants with their social profiles to enable feature engineering.

B. Train a linear regression to predict a credit default risk score. Reason: A and C are irrelevant. D is too detail and may not be possible due to legal. B is simple but good enough

You need to migrate a 2TB relational database to Google Cloud Platform. You do not have the resources to significantly refactor the application that uses this database and cost to operate is of primary concern. Which service do you select for storing and serving your data? A. Cloud Spanner B. Cloud Bigtable C. Cloud Firestore D. Cloud SQL

D. Cloud SQL | Reason: This is relational. D is not needed because Spanner is for 10TB data

You're using Bigtable for a real-time application, and you have a heavy load that is a mix of read and writes. You've recently identified an additional use case and need to perform hourly an analytical job to calculate certain statistics across the whole database. You need to ensure both the reliability of your production application as well as the analytical workload. What should you do? A. Export Bigtable dump to GCS and run your analytical job on top of the exported files. B. Add a second cluster to an existing instance with a multi-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload. C. Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload. D. Increase the size of your existing cluster twice and execute your analytics workload on your new resized cluster.

C. Add a second cluster to an existing instance with a single-cluster routing, use live-traffic app profile for your regular workload and batch-analytics profile for the analytics workload. Reason: Need a second cluster to handle the analytic workload. Only single-cluster routing is needed since it is for analytics. Multi-cluster routing is for high availability.

You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery. The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to BigQuery for analysis. Which job type and transforms should this pipeline use? A. Batch job, PubSubIO, side-inputs B. Streaming job, PubSubIO, JdbcIO, side-outputs C. Streaming job, PubSubIO, BigQueryIO, side-inputs D. Streaming job, PubSubIO, BigQueryIO, side-outputs

C. Streaming job, PubSubIO, BigQueryIO, side-inputs Reason: side input to enrich data. BigQueryIO is needed for BigQuery. For option A, batch job is ok but missing BigQueryIO

You have a data pipeline that writes data to Cloud Bigtable using well-designed row keys. You want to monitor your pipeline to determine when to increase the size of you Cloud Bigtable cluster. Which two actions can you take to accomplish this? (Choose two.) A. Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Read pressure index is above 100. B. Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Write pressure index is above 100. C. Monitor the latency of write operations. Increase the size of the Cloud Bigtable cluster when there is a sustained increase in write latency. D. Monitor storage utilization. Increase the size of the Cloud Bigtable cluster when utilization increases above 70% of max capacity. E. Monitor latency of read operations. Increase the size of the Cloud Bigtable cluster of read operations take longer than 100 ms.

C. Monitor the latency of write operations. Increase the size of the Cloud Bigtable cluster when there is a sustained increase in write latency. D. Monitor storage utilization. Increase the size of the Cloud Bigtable cluster when utilization increases above 70% of max capacity. Reason: Since the app is writing to BigTable, read option should be eliminated. Google recommended to scale BigTable based on storage.

You want to analyze hundreds of thousands of social media posts daily at the lowest cost and with the fewest steps. You have the following requirements: ✑ You will batch-load the posts once per day and run them through the Cloud Natural Language API. ✑ You will extract topics and sentiment from the posts. ✑ You must store the raw posts for archiving and reprocessing. ✑ You will create dashboards to be shared with people both inside and outside your organization. You need to store both the data extracted from the API to perform analysis as well as the raw social media posts for historical archiving. What should you do? A. Store the social media posts and the data extracted from the API in BigQuery. B. Store the social media posts and the data extracted from the API in Cloud SQL. C. Store the raw social media posts in Cloud Storage, and write the data extracted from the API into BigQuery. D. Feed to social media posts into the API directly from the source, and write the extracted data from the API into BigQuery.

C. Store the raw social media posts in Cloud Storage, and write the data extracted from the API into BigQuery. Reason: Need both GCS and BigQuery to store posts data

You store historic data in Cloud Storage. You need to perform analytics on the historic data. You want to use a solution to detect invalid data entries and perform data transformations that will not require programming or knowledge of SQL. What should you do? A. Use Cloud Dataflow with Beam to detect errors and perform transformations. B. Use Cloud Dataprep with recipes to detect errors and perform transformations. C. Use Cloud Dataproc with a Hadoop job to detect errors and perform transformations. D. Use federated tables in BigQuery with queries to detect errors and perform transformations.

B. Use Cloud Dataprep with recipes to detect errors and perform transformations. Reason: Dataprep is for use case with no SQL knowledge

Your company needs to upload their historic data to Cloud Storage. The security rules don't allow access from external IPs to their on-premises resources. After an initial upload, they will add new data from existing on-premises applications every day. What should they do? A. Execute gsutil rsync from the on-premises servers. B. Use Cloud Dataflow and write the data to Cloud Storage. C. Write a job template in Cloud Dataproc to perform the data transfer. D. Install an FTP server on a Compute Engine VM to receive the files and move them to Cloud Storage.

A. Execute gsutil rsync from the on-premises servers. | Reason: gsutil is for access restriction

You have a query that filters a BigQuery table using a WHERE clause on timestamp and ID columns. By using bq query "" -dry_run you learn that the query triggers a full scan of the table, even though the filter on timestamp and ID select a tiny fraction of the overall data. You want to reduce the amount of data scanned by BigQuery with minimal changes to existing SQL queries. What should you do? A. Create a separate table for each ID. B. Use the LIMIT keyword to reduce the number of rows returned. C. Recreate the table with a partitioning column and clustering column. D. Use the bq query - -maximum_bytes_billed flag to restrict the number of bytes billed.

C. Recreate the table with a partitioning column and clustering column. Reason: partition and cluster limits the read amount

You have a requirement to insert minute-resolution data from 50,000 sensors into a BigQuery table. You expect significant growth in data volume and need the data to be available within 1 minute of ingestion for real-time analysis of aggregated trends. What should you do? A. Use bq load to load a batch of sensor data every 60 seconds. B. Use a Cloud Dataflow pipeline to stream data into the BigQuery table. C. Use the INSERT statement to insert a batch of data every 60 seconds. D. Use the MERGE statement to apply updates in batch every 60 seconds.

B. Use a Cloud Dataflow pipeline to stream data into the BigQuery table. Reason: streaming