Google Sample Questions Flashcards
You are working on optimizing BigQuery for a query that is run repeatedly
on a single table. The data queried is about 1 GB, and some rows are
expected to change about 10 times every hour. You have optimized the SQL
statements as much as possible. You want to further optimize the query’s
performance. What should you do?
A. Create a materialized view based on the table, and query that view.
B. Enable caching of the queried data so that subsequent queries are faster.
C. Create a scheduled query, and run it a few minutes before the report has to be
created.
D. Reserve a larger number of slots in advance so that you have maximum compute power to execute the query.
A. Create a materialized view based on the table, and query that view.
A: Option A is correct because materialized views periodically cache the results of a query
for increased performance. Materialized views are suited to small datasets that are frequently queried. When underlying table data changes, the materialized view invalidates the affected portions and re-reads them.
B: Option B is not correct because caching is automatically enabled but is not performant when the underlying data changes.
C: Option C is not correct because scheduled queries let you schedule recurring queries but do not provide specific performance optimizations. Also, running a query too early could use old/stale data.
D: Option D is not correct because reserving more slots guarantees the availability of BigQuery slots but does not improve performance.
Several years ago, you built a machine learning model for an ecommerce
company. Your model made good predictions. Then a global pandemic
occurred, lockdowns were imposed, and many people started working from
home. Now the quality of your model has degraded. You want to improve
the quality of your model and prevent future performance degradation.
What should you do?
A. Retrain the model with data from the first 30 days of the lockdown.
B. Monitor data until usage patterns normalize, and then retrain the model.
C. Retrain the model with data from the last 30 days. After one year, return to the
older model.
D. Retrain the model with data from the last 30 days. Add a step to continuously
monitor model input data for changes, and retrain the model.
D. Retrain the model with data from the last 30 days. Add a step to continuously
monitor model input data for changes, and retrain the model.
A: Option A is not correct because retraining based on the data from the first 30 days of the lockdown might only be useful for predictions during similar lockdowns and not for regular periods.
B: Option B is not correct because usage patterns might have changed permanently and might continue to change in the future.
C: Option C is not correct because the older model might not be indicative of user behavior after a year.
D: Option D is correct because the data used to build the original model is no longer relevant. Retraining the model with recent data from the last 30 days will improve the predictions. To keep a watch on future data drifts, monitor the incoming data.
A new member of your development team works remotely. The developer
will write code locally on their laptop, which will connect to a MySQL
instance on Cloud SQL. The instance has an external (public) IP address. You
want to follow Google-recommended practices when you give access to
Cloud SQL to the new team member. What should you do?
A. Ask the developer for their laptop’s IP address, and add it to the authorized
networks list.
B. Remove the external IP address, and replace it with an internal IP address. Add only the IP address for the remote developer’s laptop to the authorized list.
C. Give instance access permissions in Identity and Access Management (IAM),
and have the developer run Cloud SQL Auth proxy to connect to a MySQL
instance.
D. Give instance access permissions in Identity and Access Management (IAM),
change the access to “private service access” for security, and allow the developer to access Cloud SQL from their laptop.
C. Give instance access permissions in Identity and Access Management (IAM),
and have the developer run Cloud SQL Auth proxy to connect to a MySQL
instance
A: Option A is not correct, because although adding an authorized networks list is possible, it is more effort to track it and also less secure.
B: Option B is not correct, because if you remove the external IP address, access for those who work remotely will be more complicated because they also have to be within private RFC 1918 address space.
C: Option C is correct because the recommended approach is to use Cloud SQL Auth proxy. Permissions can be controlled by IAM. You don’t need to track authorization lists for changing user IP addresses.
D: Option D is not correct because private service access will require access from a private RFC 1918 address space, which might not be available to developers who work remotely.
Your Cloud Spanner database stores customer address information that is
frequently accessed by the marketing team. When a customer enters the
country and the state where they live, this information is stored in different
tables connected by a foreign key. The current architecture has
performance issues. You want to follow Google-recommended practices to
improve performance. What should you do?
A. Create interleaved tables, and store states under the countries.
B. Denormalize the data, and have a row for each state with its corresponding
country.
C. Retain the existing architecture, but use short, two-letter codes for the countries and states.
D. Combine the countries in a single cell’s text, for example “country:state1,state2, …” and when required, split the data.
A. Create interleaved tables, and store states under the countries.
A: Option A is correct because Cloud Spanner supports interleaving that guarantees data being stored in the same split, which is performant when you need a strong data locality relationship.
B: Option B is not correct because denormalizing is not a preferred approach in relational databases. It leads to multiple rows with repeated data.
C: Option C is not correct because reducing the size of the fields to short names will have lower impact because the data access and joins will be a bigger performance issue.
D: Option D is not correct because packing multiple types of data into the same cell is not recommended for relational databases.
Your company runs its business-critical system on PostgreSQL. The system
is accessed simultaneously from many locations around the world and
supports millions of customers. Your database administration team
manages the redundancy and scaling manually. You want to migrate the
database to Google Cloud. You need a solution that will provide global scale
and availability and require minimal maintenance. What should you do?
A. Migrate to BigQuery.
B. Migrate to Cloud Spanner.
C. Migrate to a Cloud SQL for PostgreSQL instance.
D. Migrate to bare metal machines with PostgreSQL installed.
B. Migrate to Cloud Spanner.
A: Option A is not correct because BigQuery doesn’t support global scale. BigQuery also isn’t the best option for migrating a transactional database like PostgreSQL because it is
more analytics-focused.
B: Option B is correct because Cloud Spanner provides a global-scale, highly available database that supports relational data.
C: Option C is not correct because Cloud SQL options are regional and have less scalability compared to Cloud Spanner.
D: Option D is not correct because running PostgreSQL on bare metal machines requires a greater maintenance effort.
Your company collects data about customers to regularly check their health
vitals. You have millions of customers around the world. Data is ingested at
an average rate of two events per 10 seconds per user. You need to be able
to visualize data in Bigtable on a per user basis. You need to construct the
Bigtable key so that the operations are performant. What should you do?
A. Construct the key as user-id#device-id#activity-id#timestamp.
B. Construct the key as timestamp#user-id#device-id#activity-id.
C. Construct the key as timestamp#device-id#activity-id#user-id.
D. Construct the key as user-id#timestamp#device-id#activity-id.
A. Construct the key as user-id#device-id#activity-id#timestamp.
A: Option A is correct because the design does not monotonically increase, thus avoiding hotspots.
B: Option B is not correct because it monotonically increases, thus causing hotspots.
C: Option C is not correct because it monotonically increases, thus causing hotspots.
D: Option D is not correct because it monotonically increases, thus causing hotspots.
Your company is hiring several business analysts who are new to BigQuery.
The analysts will use BigQuery to analyze large quantities of data. You need
to control costs in BigQuery and ensure that there is no budget overrun
while you maintain the quality of query results. What should you do?
A. Set a customized project-level or user-level daily quota to acceptable values.
B. Reduce the data in the BigQuery table so that the analysts query less data, and
then archive the remaining data.
C. Train the analysts to use the query validator or –dry_run to estimate costs so that the analysts can self-regulate usage.
D. Export the BigQuery daily costs, and visualize the data on Looker on a per-analyst basis so that the analysts can self-regulate usage.
A: Option A is correct because if you have multiple BigQuery projects and users, you can manage costs by requesting a custom quota that specifies a limit on the amount of query data processed per day.
B: Option B is not correct because giving only partial data to the analysts might not
produce accurate query results.
C: Option C is not correct because costs could still overrun budgets. This approach
assumes that analysts always follow guidelines.
D: Option D is not correct because your costs could still overrun budgets. This approach also assumes that the analysts look at the charts daily and adjust their behavior.
Your Bigtable database was recently deployed into production. The scale of
data ingested and analyzed has increased significantly, but the
performance has degraded. You want to identify the performance issue.
What should you do?
A. Use Key Visualizer to analyze performance.
B. Use Cloud Trace to identify the performance issue.
C. Add logging statements into the code to see which inserts cause the delay.
D. Add more nodes to the cluster to see if that resolves the performance issue.
A. Use Key Visualizer to analyze performance.
A: Option A is correct because Key Visualizer for Bigtable generates visual reports for your tables that detail your usage based on the row keys that you access, show you how Bigtable operates, and can help you troubleshoot performance issues.
B: Option B is not correct because Cloud Trace is used to debug latency in applications.
C: Option C is not correct because adding logging statements won’t help you understand the performance issues within Bigtable.
D: Option D is not correct because adding more nodes might improve the performance, but the database could continue to have performance issues if the keys are not designed well.
Your company is moving your data analytics to BigQuery. Your other
operations will remain on-premises. You need to transfer 800 TB of historic
data. You also need to plan for 30 Gbps of daily data transfers that must be
appended for analysis the next day. You want to follow Google-recommended practices to transfer your data. What should you do?
A. As early as possible every day, use Cloud VPN to transfer the existing data over the internet.
B. Use a Transfer Appliance to move the existing data to Google Cloud. Use Cloud VPN to transfer data daily.
C. Use a Transfer Appliance to move the existing data to Google Cloud.. Use VPC Network Peering to transfer data daily.
D. Use a Transfer Appliance to move the existing data to Google Cloud. Set up a
Dedicated or Partner Interconnect for daily transfers.
D. Use a Transfer Appliance to move the existing data to Google Cloud. Set up a
Dedicated or Partner Interconnect for daily transfers.
A: Option A is not correct because the internet in general will have less stability and much lower speed. Transferring large amounts of data is not viable.
B: Option B is not correct because Cloud VPN is useful for data transfers at a rate of a few Gbps (1.5 Gbps to 3 Gbps).
C: Option C is not correct because VPC Network Peering is used for data transfers within Google Cloud Organizations.
D: Option D is correct because using a Transfer Appliance is recommended to transfer hundreds of terabytes of data. For large data transfers that occur regularly, a dedicated, hybrid networking connection is recommended.
Your team runs Dataproc workloads where the worker node takes about 45
minutes to process. You have been exploring various options to optimize
the system for cost, including shutting down worker nodes aggressively.
However, in your metrics you see that the entire job takes even longer. You
want to optimize the system for cost without increasing job completion
time. What should you do?
A. Set a graceful decommissioning timeout greater than 45 minutes.
B. Rewrite the processing in Cloud Data Fusion, and run the job automatically.
C. Rewrite the processing in Dataflow, and use stream processing of the same
data.
D. Increase the number of vCPUs on each worker node so that the processing
finishes sooner.
A. Set a graceful decommissioning timeout greater than 45 minutes.
A: Option A is correct because graceful decommissioning will finish work in progress on a worker node before it is removed from the Dataproc cluster.
B: Option B is not correct because rebuilding the data pipeline in Cloud Data Fusion will increase effort, cost, and time.
C: Option C is not correct because rewriting the code in Dataflow will increase effort, cost, and time.
D: Option D is not correct because increasing the number of vCPUs will greatly increase the cost.
Your customer has a SQL Server database that contains about 5 TB of data
in another public cloud. You expect the data to grow to a maximum of 25 TB.
The database is the backend of an internal reporting application that is used
once a week. You want to migrate the application to Google Cloud to
reduce administrative effort while keeping costs the same or reducing
them. What should you do?
A. Migrate the database to Bigtable.
B. Migrate the database to Cloud Spanner.
C. Install SQL Server on a Compute Engine VM.
D. Migrate the database to SQL Server in Cloud SQL.
D. Migrate the database to SQL Server in Cloud SQL.
A: Option A is not correct because Bigtable is a NoSQL database and is not a suitable destination from a SQL Server source.
B: Option B is not correct because Cloud Spanner will be costlier than Cloud SQL. Although Spanner has global availability, it is unnecessary for this application requirement.
C: Option C is not correct because installing a custom SQL Server instance on a Compute Engine VM will require more administrative effort.
D: Option D is correct because Cloud SQL provides managed MySQL, PostgreSQL, and SQL Server databases, which will reduce administrative effort. Twenty-five TB can be
accommodated efficiently on Cloud SQL.
Your IT team uses BigQuery for storing structured data. Your finance team
recently moved to Google Workspace Enterprise edition from a standalone,
desktop-based spreadsheet processor. When the finance team needs data
insights, the IT team runs a query on BigQuery, exports the data to a CSV
file, and sends the file as an email attachment to the finance team
members. You want to improve the process while you retain familiar
methods of data analysis for the finance team. What should you do?
A. Run the query in BigQuery, and give the finance team access to the results view, which can be analyzed.
B. Run the query in BigQuery, and give the finance team access to the data
visualizations in Google Data Studio.
C. Run the query in BigQuery, export the data to CSV, upload the file to a Cloud
Storage bucket, and share the file with the finance team.
D. Run the query in BigQuery, and save the results to a Google Sheets shared
spreadsheet that can be accessed and analyzed by the finance team.
D. Run the query in BigQuery, and save the results to a Google Sheets shared
spreadsheet that can be accessed and analyzed by the finance team.
A: Option A is not correct because the finance team will have to be given Google Cloud access and be trained on using BigQuery, which is not a familiar method.
B: Option B is not correct because only giving the visualizations on Data Studio won’t let the finance teams analyze the data.
C: Option C is not correct because the finance team will have to be given Google Cloud access and be trained on using Cloud Storage, which is not a familiar method.
D: Option D is correct because Connected Sheets gives you a direct and easy way to share BigQuery data through Google Sheets.
Your scooter-sharing company collects information about their scooters,
such as location, battery level, and speed. The company visualizes this data
in real time. To guard against intermittent connectivity, each scooter sends
repeats of certain messages within a short interval. Occasional data errors
have been noticed. The messages are received in Pub/Sub and stored in
BigQuery. You need to ensure that the data does not contain duplicates and
that erroneous data with empty fields is rejected. What should you do?
A. Store the data in BigQuery, and run delete queries on erroneous and duplicate data.
B. Use Dataflow to subscribe to Pub/Sub, process the data, and store the data in
BigQuery.
C. Use Kubernetes to create a microservices application that can remove duplicates and erroneous data. Then insert the data into BigQuery.
D. Create an application on Compute Engine with Managed Instance Groups that can remove duplicates and erroneous data. Then insert the data into BigQuery
B. Use Dataflow to subscribe to Pub/Sub, process the data, and store the data in
BigQuery
A: Option A is not correct because directly storing data in BigQuery could cause data to be overwritten, and erroneous data could be present before it is deleted. Workarounds within BigQuery to circumvent these concerns would cost more effort, time, and money.
B: Option B is correct because Dataflow is the recommended data processing product for streaming data. Dataflow can be programmed to remove duplicates, delete empty fields,
and perform other custom data processing.
C: Option C is not correct because creating a custom application for streaming processing on Kubernetes is a significant effort and is not recommended.
D: Option D is not correct because creating a custom application for streaming processing on Compute Engine is a significant effort and is not recommended.
Your cryptocurrency trading company visualizes prices to help your
customers make trading decisions. Because different trades happen in real
time, the price data is fed to a data pipeline that uses Dataflow for
processing. You want to compute moving averages. What should you do?
A. Use hopping windows in Dataflow.
B. Use session windows in Dataflow.
C. Use tumbling windows in Dataflow.
D. Use Dataflow SQL, and compute averages grouped by time.
A. Use hopping windows in Dataflow.
A: Option A is correct because you use hopping windows to compute moving averages.
B: Option B is not correct because session windows are not used to calculate moving averages.
C: Option C is not correct because tumbling windows are not used to calculate moving averages.
D: Option D is not correct because grouping by time alone does not give you a moving average.
You are building the trading platform for a stock exchange with millions of
traders. Trading data is written rapidly. You need to retrieve data quickly to
show visualizations to the traders, such as the changing price of a particular
stock over time. You need to choose a storage solution in Google Cloud.
What should you do?
A. Use Bigtable.
B. Use Firestore.
C. Use Cloud SQL.
D. Use Memorystore.
A. Use Bigtable.
A: Option A is correct because Bigtable is the recommended database for time series data that requires high throughput reads and writes.
B: Option B is not correct because Firestore does not have the high throughput capabilities that are suitable for time-series data.
C: Option C is not correct because Cloud SQL does not the have high throughput
capabilities that are suitable for time-series data.
D: Option D is not correct because Memorystore is a fast in-memory database that is not suitable for persistently storing large amounts of data.