DevOps 41-81 Flashcards

1
Q

41.
You support a high-traffic web application that runs on Google Cloud Platform (GCP). You need to measure application reliability from a user perspective without making any engineering changes to it. What should you do? (Choose two.)

A. Review current application metrics and add new ones as needed,

B. Modify the code to capture additional information for user interaction.

C. Analyze the web proxy logs only and capture response time of each request

D. Create new synthetic clients to simulate a user journey using the application.

E. Use current and historic Request Logs to trace customer interaction with the application.

A

D, E

DE - synthetic transactions &historic Request Logs traceDE - synthetic transactions &historic Request Logs trace

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

42.
You manage an application that is writing logs to Stackdriver Logging. You need to give some team members the ability to export logs. What should you do?

A. Grant the team members the IAM role of logging configWriter on Cloud IAM

B. Configure Access Context Manager to allow only these members to export logs

C. Create and grant a custom IAM role with the permissions logging sinks list and logging sink get

D. Create an Organizational Policy in Cloud IAM to allow only these members to create log exports.

A

A.
Grant the team members the IAM role of logging configWriter on Cloud IAM

Logs Configuration Writer
(roles/logging.configWriter)
- Provides permissions to read and write the configurations of logs-based metrics and sinks for exporting logs.
https://cloud.google.com/logging/docs/access-control

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

43.
Your application services run in Google Kubernetes Engine (GKE). You want to make sure that only images from your centrally-managed Google Container Registry (GCR)
image registry in the altostrat-images project can be deployed to the cluster while minimizing development time. What should you do?

A.
Create a custom builder for Cloud Build that will only push images to gcr io/altostrat-images

B.
Use a Binary Authorization policy that includes the whitelist name pattern gcr. io/altostrat-images/.

C.
Add logic to the deployment pipeline to check that all manifests contain only images from gcr io/altostrat-images

D.
Add a tag to each image in gcr io/altostrat-images and check that this tag is present when the image is deployed

A

B.
Use a Binary Authorization policy that includes the whitelist name pattern gcr. io/altostrat-images/.

https://cloud.google.com/binary-authorization/docs/cloud-build

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

44.
Your team has recently deployed an NGINX-based application into Google Kubernetes Engine (GKE) and has exposed it to the public via an HTTP Google Cloud Load Balancer (GCLB) ingress. You want to scale the deployment of the application’s frontend using an appropriate Service Level Indicator (SLI).
What should you do?

A.
Configure the horizontal pod autoscaler to use the average response time from the Liveness and Readiness probes.

B.
Configure the vertical pod autoscaler in GKE and enable the cluster autoscaler to scale the cluster as pods expand.

C.
Install the Stackdriver custom metrics adapter and configure a horizontal pod autoscaler to use the number of requests provided by the GCLB

D.
Expose the NGINX stats endpoint and configure the horizontal pod autoscaler to use the request metrics exposed by the NGINX deployment

A

C.
Install the Stackdriver custom metrics adapter and configure a horizontal pod autoscaler to use the number of requests provided by the GCLB

C is correct

A. Configure the horizontal pod autoscaler to use the average response time from the Liveness and Readiness Probes.
–> using health check as a trigger of scaling is weird. if the response time of the health check is delayed, it may be caused by resources issues such as CPU, memories, and so on. so you should use such values as SLIs.

B. Configure the vertical pod autoscaler in GKE and enable the cluster autoscaler to scale the cluster as pods expand.
–> it doesn’t referred to pod autoscaling.

D. Expose the NGINX stats endpoint and configure the horizontal pod autoscaler to use the request metrics exposed by the NGINX deployment.
–> if you use request metrics as SLIs, you should use custom metrics as SLIs. it is a little bit redundant.

https://cloud.google.com/kubernetes-engine/docs/tutorials/autoscaling-metrics
You want to scale horizontally

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

45.
Your company follows Site Reliability Engineering practices. You are the Incident Commander for a new, customer-impacting incident. You need to immediately assign two
incident management roles to assist you in an effective incident response. What roles should you assign? (Choose two.)

A. Operations Lead
B Engineering Lead
C. Communications Lead
D. Customer Impact Assessor
E. External Customer Communications Lead
A

A, C

https://sre.google/workbook/incident-response/

“The main roles in incident response are the Incident Commander (IC), Communications Lead (CL), and Operations or Ops

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

46.
You support an application running on GCP and want to configure SMS notifications to your team for the most critical alerts in Stackdriver Monitoring You have already identified the alerting policies you want to configure this for.
What should you do?

A.
Download and configure a third-party integration between Stackdriver Monitoring and an SMS gateway Ensure that your team members add their SMS/phone numbers
to the external tool

B.
Select the Webhook notifications option for each alerting policy, and configure it to use a third-party integration tool Ensure that your team members add their SMS/
phone numbers to the external tool

C.
Ensure that your team members set their SMS/phone numbers in their Stackdriver Profile Select the SMS notification option for each alerting policy and then select the appropriate SMS/phone numbers from the list.

D.
Configure a Slack notification for each alerting policy. Set up a Slack-to-SMS integration to send SMS messages when Slack messages are received Ensure that your
team members add their SMS/phone numbers to the external integration

A

C.
Ensure that your team members set their SMS/phone numbers in their Stackdriver Profile Select the SMS notification option for each alerting policy and then select the appropriate SMS/phone numbers from the list.

https://cloud.google.com/monitoring/support/notification-options#creating_channels
To configure SMS notifications, do the following:

In the SMS section, click Add new and follow the instructions.
Click Save.
When you set up your alerting policy, select the SMS notification type and choose a verified phone number from the list.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

47.
You are managing an application that exposes an HTTP endpoint without using a load balancer The latency of the HTTP responses is important for the user expenence. You want to understand what HTTP latencies all of your users are experiencing. You use Stackdriver Monitoring.
What should you do?

A
•In your application, create a metric with a metricKind set to DELTA and a valueType set to DOUBLE.
•In Stackdriver’s Metrics Explorer, use a Stacked Bar graph to visualize the metnc.

B.
•In your application, create a metric with a metricKind set to CUMULATIVE and a valueType set to DOUBLE.
•In Stackdriver’s Metrics Explorer, use a Line graph to visualize the metric,

C.
•In your application, create a metric with a metricKind set to GAUGE and a valueType set to DISTRIBUTION.
•In Stackdriver’s Metrics Explorer, use a Heatmap graph to visualize the metnc.

D.
•In your application, create a metric with a metricKind set to METRICKINDJJNSPECIFIED and a valueType set to INT64.
•In Stackdriver’s Metrics Explorer, use a Stacked Area graph to visualize the metric

A

C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

48.
Your team is designing a new application for deployment both inside and outside Google Cloud Platform (GCP). You need to collect detailed metrics such as system resource utilization. You want to use centralized GCP services while minimizing the amount of work required to set up this collection system.
What should you do?

A.
Import the Stackdriver Profiler package, and configure it to relay function timing data to Stackdriver for further analysis

B.
Import the Stackdriver Debugger package, and configure the application to emit debug messages with timing information.

C.
Instrument the code using a timing library, and publish the metrics via a health check endpoint that is scraped by Stackdriver

D.
Install an Application Performance Monitoring (APM) tool in both locations, and configure an export to a central data storage location for analysis

A

A.
Import the Stackdriver Profiler package, and configure it to relay function timing data to Stackdriver for further analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

49.
You need to reduce the cost of virtual machines (VM) for your organization. After reviewing different options, you decide to leverage preemptible VM instances. Which application is suitable for preemptible VMs?

A.
A scalable in-memory caching system

B.
The organization’s public-facing website

C.
A distributed, eventually consistent NoSQL database cluster with sufficient quorum

D.
A GPU-accelerated video rendering platform that retrieves and stores videos in a storage bucket

A

D.
A GPU-accelerated video rendering platform that retrieves and stores videos in a storage bucket

https://cloud.google.com/compute/docs/instances/preemptible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

50.
Your organization recently adopted a container-based workflow for application development. Your team develops numerous applications that are deployed continuously through an automated build pipeline to a Kubernetes cluster in the production environment. The security auditor is concerned that developers or operators could circumvent automated testing and push code changes to production without approval What should you do to enforce approvals?

A. Configure the build system with protected branches that require pull request approval

B. Use an Admission Controller to verify that incoming requests originate from approved sources

C. Leverage Kubernetes Role-Based Access Control (RBAC) to restrict access to only approved users

D. Enable binary authorization inside the Kubernetes cluster and configure the build pipeline as an attestor

A

D. Enable binary authorization inside the Kubernetes cluster and configure the build pipeline as an attestor

“this question is a little bit strange, but first we need to remove the invalid answers

B: Incorrect An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object, but after the request is authenticated and authorized. (its for security but not “enforce approvals”)
C: Incorrect, we need to “enforce approvals” roles apply in the cluster and Ops always could push to production without approval.
A: Incorrect, for me this answer sound well but this does not sound that an answer for a gcp exam and this do not enforce the use of the pipeline.
D: Correct, they cannot push code to production without approval because their images are not signed.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

51.
You support a stateless web-based API that is deployed on a single Compute Engine instance in the europe-west2-a zone The Service Level Indicator (SLI) for service
availability is below the specified Service Level Objective (SLO).A postmortem has revealed that requests to the API regularly time out The time outs are due to the API having a high number of requests and running out memory. You want to improve service availability.
What should you do?

A. Change the specified SLO to match the measured SLI

B. Move the service to higher-specification compute instances with more memory

C. Set up additional service instances in other zones and load balance the traffic between all instances

D. Set up additional service instances in other zones and use them as a failover in case the primary instance is unavailable

A

C. Set up additional service instances in other zones and load balance the traffic between all instances

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

52.
You are running a real-time gaming application on Compute Engine that has a production and testing environment Each environment has their own Virtual Private Cloud (VPC)
network. The application frontend and backend servers are located on different subnets in the environment’s VPC. You suspect there is a malicious process communicating
intermittently in your production frontend servers. You want to ensure that network traffic is captured for analysis.
What should you do?

A.
Enable VPC Flow Logs on the production VPC network frontend and backend subnets only with a sample volume scale of 0 5.

B.
Enable VPC Flow Logs on the production VPC network frontend and backend subnets only with a sample volume scale of 1 0

C.
Enable VPC Flow Logs on the testing and production VPC network frontend and backend subnets with a volume scale of 0 5. Apply changes in testing before production

D. Enable VPC Flow Logs on the testing and production VPC network frontend and backend subnets with a volume scale of 10. Apply changes in testing before
production

A

B.
Enable VPC Flow Logs on the production VPC network frontend and backend subnets only with a sample volume scale of 1 0

https://cloud.google.com/vpc/docs/flow-logs#log-sampling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

53.
Your team of Infrastructure DevOps Engineers is growing, and you are starting to use Terraform to manage infrastructure You need a way to implement code versioning and to share code with other team members.
What should you do?

A.
Store the Terraform code in a version-control system. Establish procedures for pushing new versions and merging with the master.

B.
Store the Terraform code in a network shared folder with child folders for each version release Ensure that everyone works on different files.

C.
Store the Terraform code in a Cloud Storage bucket using object versioning Give access to the bucket to every team member so they can download the files.

D.
Store the Terraform code in a shared Google Drive folder so it syncs automatically to every team member’s computer Organize files with a naming convention that
identifies each new version

A

A.
Store the Terraform code in a version-control system. Establish procedures for pushing new versions and merging with the master.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

54.
You are using Stackdriver to monitor applications hosted on Google Cloud Platform (GCP). You recently deployed a new application, but its logs are not appearing on the Stackdriver dashboard You need to troubleshoot the issue.
What should you do?

A.
Confirm that the Stackdriver agent has been installed in the hosting virtual machine

B.
Confirm that your account has the proper permissions to use the Stackdriver dashboard

C.
Confirm that port 25 has been opened in the firewall to allow messages through to Stackdriver

D.
Confirm that the application is using the required client library and the service account key has proper permissions

A

A.
Confirm that the Stackdriver agent has been installed in the hosting virtual machine

“Why not D, because if A is not there D is useless. Question says you are using Stackdriver monitoring, not saying you have an agent installed already. You need the agent to export logs. So first thing you’ll always see in the agent is there, and running. Next service account, next client libraries.I hope this clears your doubts.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

55.
Your organization recently adopted a container-based workflow for application development. Your team develops numerous applications that are deployed continuously through
an automated build pipeline to the production environment. A recent security audit alerted your team that the code pushed to production could contain vulnerabilities and that
the existing tooling around virtual machine (VM) vulnerabilities no longer applies to the containerized environment. You need to ensure the security and patch level of all code
running through the pipeline. What should you do?

A.
Set up Container Analysis to scan and report Common Vulnerabilities and Exposures.

B.
Configure the containers in the build pipeline to always update themselves before release

C.
Reconfigure the existing operating system vulnerability software to exist inside the container

D.
Implement static code analysis tooling against the Docker files used to create the containers

A

A.

Set up Container Analysis to scan and report Common Vulnerabilities and Exposures

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

56.
You use Cloud Build to build your application. You want to reduce the build time while minimizing cost and development effort. What should you do?

A.
Use Cloud Storage to cache intermediate artifacts

B.
Run multiple Jenkins agents to parallelize the build

C.
Use multiple smaller build steps to minimize execution time

D.
Use larger Cloud Build virtual machines (VMs) by using the machine-type option.

A

Ans A:
Use Cloud Storage to cache intermediate artifacts

https://cloud.google.com/storage/docs/best-practices

https://cloud.google.com/build/docs/speeding-up-builds#caching_directories_with_google_cloud_storage
Show less

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

57.
You support a web application that is hosted on Compute Engine The application provides a booking service for thousands of users. Shortly after the release of a new feature, your monitoring dashboard shows that all users are experiencing latency at login. You want to mitigate the impact of the incident on the users of your service.
What should you do first?

A. Roll back the recent release

B.
Review the Stackdriver monitoring

C.
Upsize the virtual machines running the login services.

D.
Deploy a new release to see whether it fixes the problem

A

A. Roll back the recent release

Rollback is needed to mitigate the impact. Once the is done review can be done

18
Q

58.
You are deploying an application that needs to access sensitive information. You need to ensure that this information is encrypted and the risk of exposure is minimal if a breach occurs.
What should you do?

A.
Store the encryption keys in Cloud Key Management Service (KMS) and rotate the keys frequently

B.
Inject the secret at the time of instance creation via an encrypted configuration management system

C.
Integrate the application with a Single sign-on (SSO) system and do not expose secrets to the application

D. Leverage a continuous build pipeline that produces multiple versions of the secret for each instance of the application

A

A.
Store the encryption keys in Cloud Key Management Service (KMS) and rotate the keys frequently

https://cloud.google.com/security-key-management

19
Q

59.
You encounter a large number of outages in the production systems you support. You receive alerts for all the outages that wake you up at night The alerts are due to unhealthy systems that are automatically restarted within a minute.You want to set up a process that would prevent staff burnout while following Site Reliability Engineering practices.
What should you do?

A.
Eliminate unactionable alerts

B.
Create an incident report for each of the alerts.

C.
Distribute the alerts to engineers in different time zones

D.
Redefine the related Service Level Objective so that the error budget is not exhausted

A

A.
Eliminate unactionable alerts

I reckon its A, the reason is because it seems like the problem is automatically fixed with an restart of the service after a minute, therefore engineers don’t really need to be woken up about these problems. If it failed multiple times or if the restart failed, then the engineer should be woken up
https://cloud.google.com/blog/products/management-tools/meeting-reliability-challenges-with-sre-principles

19
Q

60.
You have migrated an e-commerce application to Google Cloud Platform (GCP). You want to prepare the application for the upcoming busy season. What should you do first to prepare for the busy season?

A.
Load teat the application to profile its performance for scaling

B.
Enable AutoScaling on the production clusters, in case there is growth.

C.
Pre-provision double the compute power used last season, expecting growth

D. Create a runbook on inflating the disaster recovery (DR) environment if there is growth

A

A.
Load teat the application to profile its performance for scaling

https://cloud.google.com/architecture/black-friday-production-readiness#preparation_stage

The objective of the preparation stage is to test the system’s ability to scale for peak user traffic and to document the results. Completing the preparation stage results in architecture refinement to handle peak traffic more efficiently and increase system reliability. This stage also yields procedures for operations and support that help streamline processes for handling the peak event and any issues that might occur. Consider this stage as practice for the peak event from a system and operations perspective.

A is exactly what mentioned above.
B is the step after the preparation stage.

20
Q

61.
You support a web application that runs on App Engine and uses CloudSQL and Cloud Storage for data storage After a short spike in website traffic, you notice a big increase
in latency for all user requests, increase in CPU use, and the number of processes running the application Initial troubleshooting reveals:

• After the initial spike in traffic, load levels returned to normal but users still experience high latency.
Requests for content from the CloudSQL database and images from Cloud Storage show the same high latency.
• No changes were made to the website around the time the latency increased.
There is no increase in the number of errors to the users.
You expect another spike in website traffic in the coming days and want to make sure users don’t experience latency.
What should you do?

A.
Upgrade the GCS buckets to Multi-Regional

B.
Enable high availability on the CloudSQL instances,

C.
Move the application from App Engine to Compute Engine

D.
Modify the App Engine configuration to have additional idle instances

A

D.
Modify the App Engine configuration to have additional idle instances

“Scaling App Engine scales the number of instances automatically in response to processing volume. This scaling factors in the automatic_scaling settings that are provided on a per-version basis in the configuration file. A service with basic scaling is configured by setting the maximum number of instances in the max_instances parameter of the basic_scaling setting. The number of live instances scales with the processing volume. You configure the number of instances of each version in that service’s configuration file. The number of instances usually corresponds to the size of a dataset being held in memory or the desired throughput for offline work. You can adjust the number of instances of a manually-scaled version very quickly, without stopping instances that are currently running, using the Modules API set_num_instances function.”

https://cloud.google.com/appengine/docs/standard/python/how-instances-are-managed

21
Q

62.
Your application runs on Google Cloud Platform (GCP). You need to implement Jenkins for
deploying application releases to GCP. You want to streamline the release process lower operational toil, and keep user data secure.
What should you do?

A.
Implement Jenkins on local workstations.

B.
Implement Jenkins on Kubernetes on-premises.

C.
Implement Jenkins on Google Cloud Functions

D.
Implement Jenkins on Compute Engine virtual machines ,

A

D.

Implement Jenkins on Compute Engine virtual machines ,

22
Q

63.
You are working with a government agency that requires you to archive application logs for seven years. You need to configure Stackdriver to export and store the logs while minimizing costs of storage. What should you do?

A.
Create a Cloud Storage bucket and develop your application to send logs directly to the bucket

B.
Develop an App Engine application that pulls the logs from Stackdriver and saves them in BigQuery.

C.
Create an export in Stackdriver and configure Cloud Pub/Sub to store logs in permanent storage for seven years.

D.
Create a sink in Stackdriver, name it. create a bucket on Cloud Storage for storing archived logs, and then select the bucket as the log export destination.

A

D.
Create a sink in Stackdriver, name it. create a bucket on Cloud Storage for storing archived logs, and then select the bucket as the log export destination.

https://cloud.google.com/logging/docs/routing/overview

23
Q

64.
You support a trading application written in Python and hosted on App Engine flexible environment You want to customize the error information being sent to Stackdriver Error Reporting.
What should you do?

A.
Install the Stackdriver Error Reporting library for Python, and then run your code on a Compute Engine VM

B.
Install the Stackdriver Error Reporting library for Python, and then run your code on Google Kubernetes Engine

C.
Install the Stackdriver Error Reporting library for Python, and then run your code on App Engine flexible environment.

D.
Use the Stackdriver Error Reporting API to write errors from your application to ReportedErrorEvent. and then generate log entries with properly formatted error
messages in Stackdriver Logging

A

C.
Install the Stackdriver Error Reporting library for Python, and then run your code on App Engine flexible environment.

“Answer is C, the link you shared has a pip install in the beginning which means Python requires installing library first. pip install –upgrade google-cloud-error-reporting”

https://cloud.google.com/error-reporting/docs/setup/python#app-engine

24
Q

65.
You need to define Service Level Objectives (SLOs) for a high-traffic multi-region web application. Customers expect the application to always be available and have fast
response times. Customers are currently happy with the application performance and availability. Based on current measurement, you observe that the 90* percentile of latency is 120ms and the 95* percentile of latency is 275ms over a 28-day window What latency SLO would you recommend to the team to publish?

A.
90* percentile - 100ms
95* percentile - 250ms

B.
90r percentile - 120ms
95* percentile - 275ms

C.
90* percentile - 150ms
95* percentile - 300ms

D.
90* percentile - 250ms
95* percentile - 400ms

A

C.
90* percentile - 150ms
95* percentile - 300ms

https://sre.google/sre-book/service-level-objectives/
“Don’t pick a target based on current performance”

25
Q

66.
You support a large service with a well-defined Service Level Objective (SLO). The development team deploys new releases of the service multiple times a week. If a major incident causes the service to miss its SLO, you want the development team to shift its focus from working on features to improving service reliability.
What should you do
before a major incident occurs?

A.
Develop an appropriate error budget policy in cooperation with all service stakeholders

B.
Negotiate with the product team to always prioritize service reliability over releasing new features.

C.
Negotiate with the development team to reduce the release frequency to no more than once a week.

D.
Add a plugin to your Jenkins pipeline that prevents new releases whenever your service is out of SLO

A

A.
Develop an appropriate error budget policy in cooperation with all service stakeholders

Reason : Incident has not occurred yet, even when development team is already pushing new features multiple times a week.
The option A says, to define an error budget “policy”, not to define error budget(It is already present). Just simple means to bring in all stakeholders, and decide how to consume the error budget effectively that could bring balance between feature deployment and reliability.

26
Q

67.
Your company is developing applications that are deployed on Google Kubernetes Engine (GKE). Each team manages a different application. You need to create the development and production environments for each team, while minimizing costs Different teams should not be able to access other teams’ environments.
What should you do?

A.
Create one GCP Project per team. In each project, create a cluster for Development and one for Production Grant the teams IAM access to their respective clusters.

B.
Create one GCP Project per team In each project, create a cluster with a Kubernetes namespace for Development and one for Production Grant the teams IAM access to their respective clusters.

C.
Create a Development and a Production GKE cluster in separate projects In each cluster, create a Kubernetes namespace per team, and then configure Identity Aware
Proxy so that each team can only access its own namespace

D.
Create a Development and a Production GKE cluster in separate projects. In each cluster, create

A

D.
Create a Development and a Production GKE cluster in separate projects. In each cluster, create

https://cloud.google.com/architecture/prep-kubernetes-engine-for-prod#roles_and_groups

27
Q

68.
Some of your production services are running in Google Kubemetes Engine (GKE) in the eu-west-1 region. Your build system runs in the us-west-1 region You want to push
the container images from your build system to a scalable registry to maximize the bandwidth for transferring the images to the cluster What should you do?

A.
Push the images to Google Container Registry (GCR) using the gcr io hostname

B.
Push the images to Google Container Registry (GCR) using the us gcr.io hostname

C.
Push the images to Google Container Registry (GCR) using the eu gcr.io hostname

D.
Push the images to a private image registry running on a Compute Engine instance in the eu-west-1 region

A

C.
Push the images to Google Container Registry (GCR) using the eu gcr.io hostname

“To maximize the bandwidth for transferring the images to the cluster, one needs the the registry to be closer to the production or system where it needs to be deployed. I would go with C, since the production system is in Europe.”

28
Q

69.
You manage several production systems that run on Compute Engine in the same Google Cloud Platform (GCP) project Each system has its own set of dedicated Compute Engine instances. You want to know how must it costs to run each of the systems. What should you do?

A.
In the Google Cloud Platform Console, use the Cost Breakdown section to visualize the costs per system

B.
Assign all instances a label specific to the system they run Configure BigQuery billing export and query costs per label

C.
Enrich all instances with metadata specific to the system they run Configure Stackdriver Logging to export to BigQuery. and query costs based on the metadata

D.
Name each virtual machine (VM) after the system it runs Set up a usage report export to a Cloud Storage bucket Configure the bucket as a source in BigQuery to query costs based on VM name

A

B.
Assign all instances a label specific to the system they run Configure BigQuery billing export and query costs per label

https://cloud.google.com/billing/docs/how-to/export-data-bigquery;

29
Q

70.
You use Cloud Build to build and deploy your application. You want to securely incorporate database credentials and other application secrets into the build pipeline You also want to minimize the development effort What should you do?

A.
Create a Cloud Storage bucket and use the built-in encryption at rest Store the secrets in the bucket and grant Cloud Build access to the bucket

B.
Encrypt the secrets and store them in the application repository Store a decryption key in a separate repository and grant Cloud Build access to the repository.

C.
Use client-side encryption to encrypt the secrets and store them in a Cloud Storage bucket Store a decryption key in the bucket and grant Cloud Build access to the bucket.

D.
Use Cloud Key Management Service (Cloud KMS) to encrypt the secrets and include them in your Cloud Build deployment configuration. Grant Cloud Build access to the KeyRing

A

D.
Use Cloud Key Management Service (Cloud KMS) to encrypt the secrets and include them in your Cloud Build deployment configuration. Grant Cloud Build access to the KeyRing

30
Q

71.
You support a popular mobile game application deployed on Google Kubernetes Engine (GKE) across several Google Cloud regions. Each region has multiple Kubernetes clusters. You receive a report that none of the users in a specific region can connect to the application. You want to resolve the incident while following Site Reliability
Engineering practices.What should you do first?

A.
Reroute the user traffic from the affected region to other regions that don’t report issues

B.
Use Stackdriver Monitoring to check for a spike in CPU or memory usage for the affected region

C.
Add an extra node pool that consists of high memory and high CPU machine type instances to the cluster

D.
Use Stackdriver Logging to filter on the clusters in the affected region, and inspect error messages in the logs

A

A.
Reroute the user traffic from the affected region to other regions that don’t report issues

“Google always aims to first stop the impact of an incident, and then find the root cause (unless the root cause just happens to be identified early on).”

31
Q

72.
You are writing a postmortem for an incident that severely affected users. You want to prevent similar incidents in the future. Which two of the following sections should you
include in the postmortem? (Choose two.)

A.
An explanation of the root cause of the incident

B.
A list of employees responsible for causing the incident

C.
A list of action items to prevent a recurrence of the incident

D.
Your opinion of the incident’s severity compared to past incidents

E.
Copies of the design documents for all the services impacted by the incident

A

A, C

32
Q

73.
You are ready to deploy a new feature of a web-based application to production. You want to use Google Kubernetes Engine (GKE) to perform a phased rollout to half of the
web server pods.
What should you do?

A.
Use a partitioned rolling update.

B.
Use Node taints with NoExecute

C.
Use a replica set in the deployment specification

D.
Use a stateful set with parallel pod management policy

A

A.
Use a partitioned rolling update.

https://cloud.google.com/kubernetes-engine/docs/how-to/updating-apps#partitioning_a_rollingupdate

32
Q

74.
You are responsible for the reliability of a high-volume enterprise application. A large number of users report that an important subset of the application’s functionality - a data intensive reporting feature -is consistently failing with an HTTP 500 error. When you investigate your application’s dashboards, you notice a strong correlation between the failures and a metric that represents the size of an internal queue used for generating reports.You trace the failures to a reporting backend that is experiencing high I/O wait times. You quickly fix the issue by resizing the backend’s persistent disk (PD) How you need to create an availability Service Level Indicator (SLI) for the report generation feature. How would you define it?

A.
As the I/O wait times aggregated across all report generation backends

B.
As the proportion of report generation requests that result in a successful response

C.
As the application’s report generation queue size compared to a known-good threshold

D.
As the reporting backend PD throughout capacity compared to a known-good threshold

A

B is correct.

The question is: “create an availability SLI for the report generation feature.” => availability. Others aren’t availability SLI.

33
Q

75.
You have an application running in Google Kubernetes Engine The application invokes multiple services per request but responds too slowly. You need to identify which downstream service or services are causing the delay. What should you do?

A.
Analyze VPC flow logs along the path of the request

B.
Investigate the Liveness and Readiness probes for each service

C.
Create a Dataflow pipeline to analyze service metrics in real time.

D.
Use a distributed tracing framework such as OpenTelemetry or Stackdriver Trace

A

D.

Use a distributed tracing framework such as OpenTelemetry or Stackdriver Trace

34
Q

76.
You are creating and assigning action items in a postmodern for an outage The outage is over, but you need to address the root causes. You want to ensure that your team handles the action items quickly and efficiently. How should you assign owners and collaborators to action items?

A.
Assign one owner for each action item and any necessary collaborators.

B.
Assign multiple owners for each item to guarantee that the team addresses items quickly

C.
Assign collaborators but no individual owners to the items to keep the postmortem blameless.

D.
Assign the team lead as the owner for all action items because they are in charge of the SRE team

A

A.
Assign one owner for each action item and any necessary collaborators.

“A is correct. Each action item should have one owner.”

https://sre.google/sre-book/example-postmortem/

35
Q

77.
Your development team has created a new version of their service’s API You need to deploy the new versions of the API with the least disruption to third-party developers and end users of third-party installed applications. What should you do?

A.

  • Introduce the new version of the API.
  • Announce deprecation of the old version of the API.
  • Deprecate the old version of the API.
  • Contact remaining users of the old API.
  • Provide best effort support to users of the old API.
  • Turn down the old version of the API

B.
- Announce deprecation of the old version of the API.
Introduce the new version of the API
- Contact remaining users on the old API
- Deprecate the old version of the API.
- Turn down the old version of the API.
- Provide best effort support to users of the old API.

C.

  • Announce deprecation of the old version of the API.
  • Contact remaining users on the old API
  • Introduce the new version of the API.
  • Deprecate the old version of the API
  • Provide best effort support to users of the old API
  • Turn down the old version of the API

D.

  • Introduce the new version of the API.
  • Contact remaining users of the old API.
  • Announce deprecation of the old version of the API
  • Deprecate the old version of the API.
  • Turn down the old version of the API
  • Provide best effort support to users of the old API.
A

A.

  • Introduce the new version of the API.
  • Announce deprecation of the old version of the API.
  • Deprecate the old version of the API.
  • Contact remaining users of the old API.
  • Provide best effort support to users of the old API.
  • Turn down the old version of the API
36
Q

78.
You are running an application on Compute Engine and collecting logs through Stackdriver. You discover that some personally identifiable information (Pll) is leaking into
certain log entry fields You want to prevent these fields from being written in new log entries as quickly as possible. What should you do?

A.
Use the filter-record-transformer Fluentd filter plugin to remove the fields from the log entries in flight

B.
Use the fluent-plugin-record-reformer Fluentd output plugin to remove the fields from the log entries in flight

C.
Wait for the application developers to patch the application, and then verify that the log entries are no longer exposing Pll

D.
Stage log entries to Cloud Storage, and then trigger a Cloud Function to remove the fields and write the entries to Stackdriver via the Stackdriver Logging API

A

A.
Use the filter-record-transformer Fluentd filter plugin to remove the fields from the log entries in flight

“Seems both A and B will work.
However i will go with A, since it is included in the fluentd core and does not require installing a new plugin”

“The filter_record_transformer filter plugin mutates/transforms incoming event streams in a versatile manner. If there is a need to add/delete/modify events, this plugin is the first filter to try.
It is included in the Fluentd’s core.”

https://docs.fluentd.org/filter/record_transformer

37
Q

79.
You support a service that recently had an outage The outage was caused by a new release that exhausted the service memory resources. You rolled back the release
successfully to mitigate the impact on users. You are now in charge of the post-mortem for the outage. You want to follow Site Reliability Engineering practices when developing the post-mortem.
What should you do?

A.
Focus on developing new features rather than avoiding the outages from recurring

B.
Focus on identifying the contributing causes of the incident rather than the individual responsible for the cause,

C.
Plan individual meetings with all the engineers involved Determine who approved and pushed the new release to production.

D.
Use the Git history to find the related code commit Prevent the engineer who made that commit from working on production services

A

B.

Focus on identifying the contributing causes of the incident rather than the individual responsible for the cause,

38
Q

80.
You support a user-facing web application. When analyzing the application’s error budget over the previous six months, you notice that the application has never consumed more than 5% of its error budget in any given time window You hold a Service Level Objective (SLO) review with business stakeholders and confirm that the SLO is set appropriately. You want your application’s SLO to more closely reflect its observed reliability What steps can you take to further that goal while balancing velocity, reliability, and business needs? (Choose two.)

A.
Add more serving capacity to all of your application’s zones.

B.
Have more frequent or potentially risky application releases

C.
Tighten the SLO match the application’s observed reliability.

D.
Implement and measure additional Service Level Indicators (SLIs) fro the application.

E.
Announce planned downtime to consume more error budget and ensure that users are not depending on a tighter SLO

A

D, E

” I vote for D+E if you read “The Global Chubby Planned Outage”
https://sre.google/sre-book/service-level-objectives/

39
Q

81.
You support a service with a well-defined Service Level Objective (SLO). Over the previous 6 months, your service has consistently met its SLO and customer satisfaction has been consistently high. Most of your service’s operations tasks are automated and few repetitive tasks occur frequently. You want to optimize the balance between reliability and deployment velocity while following site reliability engineering best practices What should you do? (Choose two.)

A.
Make the service’s SLO more strict

B.
Increase the service’s deployment velocity and/or risk.

C.
Shift engineering time to other services that need more reliability

D. Get the product team to prioritize reliability work over new features.

E. Change the implementation of your Service Level Indicators (SLIs) to increase coverage

A

B, C

https://sre.google/workbook/implementing-slos/#slo-decision-matrix

A: wrong – SLO is already well-defined, customer satisfaction is high.
E: wrong – change SLI means how SLO, which is already well-defined.

C, D are valid, but the best option is C because current product is already quite reliable.