DevOps 1-40 Flashcards

1
Q

1.
You support a Node js application running on Google Kubernetes Engine (GKE) in production. The application makes several HTTP requests to dependent applications. You
want to anticipate which dependent applications might cause performance issues. What should you do?

A. Instrument all applications with Stackdriver Profiler.

B. Instrument all applications with Stackdriver Trace and review inter-service HTTP requests.

C. Use Stackdriver Debugger to review the execution of logic within each application to instrument all applications

D. Modify the Node js application to log HTTP request and response times to dependent applications Use Stackdriver Logging to find dependent applications that are
performing poorly

A

B. Instrument all applications with Stackdriver Trace and review inter-service HTTP requests.

“The keyword is “make several requests to dependent app”. So you need trace for it.

Cloud Trace
Find performance bottlenecks in production.

Cloud Profiler
Continuous CPU and heap profiling to improve performance and reduce costs.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

2.
You created a Stackdriver chart for CPU utilization in a dashboard within your workspace project. You want to share the chart with your Site Reliability Engineering (SRE) team only. You want to ensure you follow the principle of least privilege. What should you do?

A. Share the workspace Project ID with the SRE team Assign the SRE team the Monitoring Viewer IAM role in the workspace project

B. Share the workspace Project ID with the SRE team Assign the SRE team the Dashboard Viewer IAM role in the workspace project

C. Click ‘Share chart by URL” and provide the URL to the SRE team Assign the SRE team the Monitoring Viewer IAM role in the workspace project

D. Click ‘Share chart by URL” and provide the URL to the SRE team Assign the SRE team the Dashboard Viewer IAM role in the workspace project

A

C. Click ‘Share chart by URL” and provide the URL to the SRE team Assign the SRE team the Monitoring Viewer IAM role in the workspace project

“I think it’s C, because dashboard viewer “Read-only access to dashboard configurations.”
SRE team wants to view data, not configurations.

correct there is no such role - “dashboard viewer” the correct name is monitoring dashboard configuration viewer (and the permission is - Read-only access to dashboard configurations).”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

3.
Your organization wants to implement Site Reliability Engineering (SRE) culture and principles Recently, a service that you support had a limited outage. A manager on
another team asks you to provide a formal explanation of what happened so they can action remediations. What should you do?

A. Develop a postmortem that includes the root causes, resolution, lessons learned and a prioritized list of action items Share it with the manager only.

B. Develop a postmortem that includes the root causes, resolution, lessons learned, and a prioritized list of action items Share it on the engineering organization’s
document portal

C. Develop a postmortem that includes the root causes, resolution, lessons learned, the list of people responsible, and a list of action items for each person Share it with
the manager only.

D. Develop a postmortem that includes the root causes, resolution, lessons learned the list of people responsible and a list of action items for each person Share it on
the engineering organization’s document portal

A

B it could be based on this In order to maintain a healthy postmortem culture within an organization, it’s important to share postmortems as widely as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

4.
You have a set of applications running on a Google Kubernetes Engine (GKE) cluster, and you are using Stackdriver Kubernetes Engine Monitoring You are bringing a new containerized application required by your company into production This application is written by a third party and cannot be modified or reconfigured The application writes its log information to /var/log/app_messages.log. and you want to send these log entnes to Stackdriver Logging. What should you do?

A. Use the default Stackdriver Kubernetes Engine Monitoring agent configuration

B. Deploy a Fluentd daemonset to GKE Then create a customized input and output configuration to tail the log file in the application’s pods and write to Stackdriver
Logging

C. Install Kubernetes on Google Compute Engine (GCE) and redeploy your applications Then customize the built-in Stackdriver Logging configuration to tail the log file in the application’s pods and write to Stackdriver Logging

D. Write a script to tail the log file within the pod and write entries to standard output Run the script as a sidecar container with the application’s pod Configure a shared
volume between the containers to allow the script to have read access to /var/log in the application container

A

B. Deploy a Fluentd daemonset to GKE Then create a customized input and output configuration to tail the log file in the application’s pods and write to Stackdriver
Logging

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. You are running an application in a virtual machine (VM) using a custom Debian image The image has the Stackdriver Logging agent installed The VM has the cloudplatform scope The application is logging information via syslog. You want to use Stackdriver Logging in the Google Cloud Platform Console to visualize the logs You notice that syslog is not showing up in the “All logs” dropdown list of the Logs Viewer What is the first thing you should do?

A. Look for the agent’s test log entry in the Logs Viewer

B. Install the most recent version of the Stackdriver agent

C. Verify the VM service account access scope includes the monitoring.write scope

D. SSH to the VM and execute the following commands on your VM: ps ax | grep fluentd.

A

I think D

Reason : When an instance is created, we can specify which service account the instance uses when calling Google Cloud APIs. The instance is automatically configured with access scope and one such access scope is monitoring.write (Link : https://cloud.google.com/compute/docs/access/service- read is to publish metric data and logging.write is to write compute engine logs.

Considering above, I believe D as the answer (check whether the agent is running)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

6.
You use a multiple step Cloud Build pipeline to build and deploy your application to Google Kubernetes Engine (GKE). You want to integrate with a third-party momtonng
platform by performing a HTTP POST of the build information to a webhook. You want to minimize the development effort. What should you do?

A. Add logic to each Cloud Build step to HTTP POST the build information to a webhook.

B. Add a new step at the end of the pipeline in Cloud Build to HTTP POST the build information to a webhook

C. Use Stackdriver Logging to create a logs-based metric from the Cloud Build logs Create an Alert with a Webhook notification type

D. Create a Cloud Pub/Sub push subscription to the Cloud Build cloud-builds PubSub topic to HTTP POST the build information to a webhook.

A

D. Create a Cloud Pub/Sub push subscription to the Cloud Build cloud-builds PubSub topic to HTTP POST the build information to a webhook.

“A: No becauseThere is not Structure attribute to create a http request in the steps and remember you want minimize the development effort.
B: The same A
C: minimize the development effort
D: Its OK

To receive messages from push subscriptions, use a webhook and process the POST requests that Pub/Sub sends to the push endpoint. For more information about processing these POST requests in App Engine, see Writing and responding to Pub/Sub messages.”

https: //cloud.google.com/pubsub/docs/push
https: //cloud.google.com/build/docs/subscribe-build-notifications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

7.
You use Spinnaker to deploy your application and have created a canary deployment stage in the pipeline. Your application has an in-memory cache that loads objects at start time. You want to automate the comparison of the canary version against the production version. How should you configure the canary analysis?

A. Compare the canary with a new deployment of the current production version.

B. Compare the canary with a new deployment of the previous production version

C. Compare the canary with the existing deployment of the current production version

D. Compare the canary with the average performance of a sliding window of previous production versions.

A

A. Compare the canary with a new deployment of the current production version.

“Ans A

https://spinnaker.io/guides/user/canary/best-practices/#compare-canary-against-baseline-not-against-production
You might be tempted to compare the canary deployment against your current production deployment. Instead always compare the canary against an equivalent baseline, deployed at the same time.

The baseline uses the same version and configuration that is currently running in production, but is otherwise identical to the canary:

Same time of deployment
Same size of deployment
Same type and amount of traffic
In this way, you control for version and configuration only, and you reduce factors that could affect the analysis, like the cache warmup time, the heap size, and so on.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

8.
You support a high-traffic web application and want to ensure that the home page loads in a timely manner.As a first step, you decide to implement a Service Level Indicator (SLI) to represent home page request latency with an acceptable page load time set to 100 ms. What is the Google-recommended way of calculating this SLI?

A. Bucketize the request latencies into ranges, and then compute the percentile at 100 ms

B. Bucketize the request latencies into ranges, and then compute the median and 90th percentiles

C. Count the number of home page requests that load in under 100 ms, and then divide by the total number of home page requests.

D, Count the number of home page request that load in under 100 ms, and then divide by the total number of all web application requests

A

C. Count the number of home page requests that load in under 100 ms, and then divide by the total number of home page requests.

“Ans C

https://sre.google/workbook/implementing-slos/
In the SRE principles book, it’s recommended treating the SLI as the ratio of two numbers: the number of good events divided by the total number of events. For example:
Number of successful HTTP requests / total HTTP requests (success rate)”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

9.
You deploy a new release of an internal application during a weekend maintenance window when there is minimal user tragic After the window ends, you learn that one of the new features isn’t working as expected in the production environment After an extended outage, you roll back the new release and deploy a fix. You want to modify your
release process to reduce the mean time to recovery so you can avoid extended outages in the future. What should you do? (Choose two.)

A. Before merging new code, require 2 different peers to review the code changes.

B. Adopt the blue/green deployment strategy when releasing new code via a CD server, i ]
C. Integrate a code linting tool to validate coding standards before any code is accepted into the repository.

D. Require developers to run automated integration tests on their local development environments before release

E. Configure a Cl server Add a suite of unit tests to your code and have your Cl server run them on commit and verify any changes

A

Ans: B & E

A: No, More peers to review dont automate anything
B: Ok CD
C: No, Linting is for code format
D: No, Integration test are needed but its better automatically
E: Ok CI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

10.
You have a pool of application servers running on Compute Engine. You need to provide a secure solution that requires the least amount of configuration and allows developers
to easily access application logs for troubleshooting How would you implement the solution on GCP?

A.
•Deploy the Stackdriver logging agent to the application servers.
•Give the developers the IAM Logs Viewer role to access Stackdriver and view logs

B.
•Deploy the Stackdriver logging agent to the application servers
•Give the developers the IAM Logs Private Logs Viewer role to access Stackdriver and view logs

C.
•Deploy the Stackdriver monitoring agent to the application servers
•Give the developers the IAM Monitoring Viewer role to access Stackdriver and view metrics

D.
•Install the gsutil command line tool on your application servers
•Write a script using gsutil to upload your application log to a Cloud Storage bucket, and then schedule it to run via cron every 5 minutes.
•Give the developers the IAM Object Viewer access to view the logs in the specified bucket.
tvt_vn/ebay

A

A.
•Deploy the Stackdriver logging agent to the application servers.
•Give the developers the IAM Logs Viewer role to access Stackdriver and view logs

“Answer A
roles/logging.viewer (Logs Viewer) gives you read-only access to all features of Logging, except Access Transparency logs and Data Access audit logs.

https://cloud.google.com/logging/docs/access-control”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

11.
You support the backend of a mobile phone game that runs on a Google Kubernetes Engine (GKE) cluster.The application is serving HTTP requests from users. You need to implement a solution that will reduce the network cost. What should you do?

A. Configure the VPC as a Shared VPC Host project

B. Configure your network services on the Standard Tier

C. Configure your Kubernetes cluster as a Private Cluster

D. Configure a Google Cloud HTTP Load Balancer as Ingress

A

D. Configure a Google Cloud HTTP Load Balancer as

“A: No, Doest make sense
B: Who says that we are using a premium tier?
C: This does not help with the network cost?
D: Ok :)
Costs associated with a load balancer are charged to the project containing the load balancer components.
Because of these benefits, container-native load balancing is the recommended solution for load balancing through Ingress. When NEGs are used with GKE Ingress, the Ingress controller facilitates the creation of all aspects of the L7 load balancer. This includes creating the virtual IP address, forwarding rules, health checks, firewall rules, and more.”

https://cloud.google.com/architecture/best-practices-for-running-cost-effective-kubernetes-applications-on-gke

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

12.
You encountered a major service outage that affected all users of the service for multiple hours After several hours of incident management, the service returned to normal, and
user access was restored You need to provide an incident summary to relevant stakeholders following the Site Reliability Engineering recommended practices. What should you do first?

A. Call individual stakeholders to explain what happened

B. Develop a post-mortem to be distributed to stakeholders

C. Send the Incident State Document to all the stakeholders

D. Require the engineer responsible to write an apology email to all stakeholders

A

B. Develop a post-mortem to be distributed to stakeholders

“B, blameless postmortem
https://sre.google/sre-book/postmortem-culture/”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

13.
You are performing a semi-annual capacity planning exercise for your flagship service. You expect a service user growth rate of 10% month-over-month over the next six
months. Your service is fully containerized and runs on Google Cloud Platform (GCP), using a Google Kubernetes Engine (GKE) Standard regional cluster on three zones with
cluster autoscaler enabled. You currently consume about 30% of your total deployed CPU capacity, and you require resilience against the failure of a zone.You want to ensure
that your users experience minimal negative impact as a result of this growth or as a result of zone failure, while avoiding unnecessary costs.

How should you prepare to
handle the predicted growth?

A. Verify the maximum node pool size, enable a horizontal pod autoscaler, and then perform a load test to verify your expected resource needs.

B. Because you are deployed on GKE and are using a cluster autoscaler, your GKE cluster will scale automatically, regardless of growth rate.

C. Because you are at only 30% utilization, you have significant headroom and you won’t need to add any additional capacity for this rate of growth

D. Proactively add 60% more node capacity to account for six months of 10% growth rate, and then perform a load test to make sure you have enough capacity

A

A.
Verify the maximum node pool size, enable a horizontal pod autoscaler, and then perform a load test to verify your expected resource needs.

“A: Correct. The Horizontal Pod Autoscaler changes the shape of your Kubernetes workload by automatically increasing or decreasing the number of Pods in response to the workload’s CPU or memory consumption
B: Incorrect. It is not based on the CPU its based on the workload
C: No, Hope is not an strategy
D: No, have more resource than needed”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

14.
Your application images are built and pushed to Google Container Registry (GCR). You want to build an automated pipeline that deploys the application when the image is
updated while minimizing the development effort What should you do?

A. Use Cloud Build to trigger a Spinnaker pipeline

B. Use Cloud Pub/Sub to trigger a Spinnaker pipeline

C. Use a custom builder in Cloud Build to trigger Jenkins pipeline

D. Use Cloud Pub/Sub to trigger a custom deployment service running in Google Kubernetes Engine (GKE).

A

B.
Use Cloud Pub/Sub to trigger a Spinnaker pipeline

B is correct : https://cloud.google.com/architecture/continuous-delivery-toolchain-spinnaker-cloud#triggering_a_spinnaker_pipeline_when_a_docker_image_is_pushed_to_container_registry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

15.
Your product is currently deployed in three Google Cloud Platform (GCP) zones with your users divided between the zones. You can fail over from one zone to another, but it
causes a 10-minute service disruption for the affected users. You typically experience a database failure once per quarter and can detect it within five minutes. You are cataloging the reliability risks of a new real-time chat feature for your product. You catalog the following information for each risk:
•Mean Time to Detect (MTTD) in minutes
•Mean Time to Repair (MTTR) in minutes
•Mean Time Between Failure (MTBF) in days
•User Impact Percentage

The chat feature requires a new database system that takes twice as long to successfully fail over between zones. You want to account for the risk of the new database failing in one zone. What would be the values for the risk of database failover with the new system?

A.
MTTD: 5
MTTR: 10
MTBF 90
Impact 33%
B.
MTTD 5
MTTR 20
MTBF: 90
Impact: 33%
C.
 MTTD: 5
MTTR 10
MTBF 90
Impact: 50%
D. 
MTTD: 5
MTTR: 20
MTBF: 90
Impact 50%
A
B.
MTTD 5
MTTR 20
MTBF: 90
Impact: 33%
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

16.
You are managing the production deployment to a set of Google Kubernetes Engine (GKE) clusters. You want to make sure only images which are successfully built by your trusted Cl/CD pipeline are deployed to production. What should you do?

A. Enable Cloud Security Scanner on the clusters

B. Enable Vulnerability Analysis on the Container Registry.

C. Set up the Kubernetes Engine clusters as private clusters

D. Set up the Kubernetes Engine clusters with Binary Authorization

A

D. Set up the Kubernetes Engine clusters with Binary Authorization

“D because binary authorization is deploy time security tool and it will allow only trusted and attested containers into GKE”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. You support an e-commerce application that runs on a large Google Kubernetes Engine (GKE) cluster deployed on-premises and on Google Cloud Platform. The application consists of microservices that run in containers. You want to identify containers that are using the most CPU and memory. What should you do?
    A. Use Stackdriver Kubernetes Engine Monitoring.
    B. Use Prometheus to collect and aggregate logs per container, and then analyze the results in Grafana.
    C. Use the Stackdriver Monitoring API to create custom metrics, and then organize your containers using groups.
    D. Use Stackdriver Logging to export application logs to BigQuery, aggregate logs per container, and then analyze CPU and memory consumption.
A

Obviously, A is correct.
* https://cloud.google.com/anthos/clusters/docs/on-prem
GKE on-prem is also called Anthos clusters on VMware

  • https://cloud.google.com/anthos/clusters/docs/on-prem/concepts/logging-and-monitoring
    You have several logging and monitoring options for your Anthos clusters on VMware:
    + Cloud Logging and Cloud Monitoring, enabled by in-cluster agents deployed with Anthos clusters on VMware.
    + Prometheus and Grafana, disabled by default.
    + Validated configurations with third-party solutions.

=> it means, if not a special situation, the correct should be using the first option: Logging and Monitoring. In this case, we want metrics, so Monitoring (aka. Cloud Monitoring, Stackdriver Monitoring) should be used. We are talking about GKE, so we will use Kubernetest Engine Monitoring (https://cloud.google.com/kubernetes-engine-monitoring).

17
Q

18.
Your company experiences bugs, outages, and slowness in its production systems. Developers use the production environment for new feature development and bug fixes. Configuration and experiments are done in the production environment, causing outages for users. Testers use the production environment for load testing, which often slows the production systems You need to redesign the environment to reduce the number of bugs and outages in production and to enable testers to toad test new features. What
should you do?

A. Create an automated testing script in production to detect failures as soon as they occur.

B. Create a development environment with smaller server capacity and give access only to developers and testers

C. Secure the production environment to ensure that developers can’t change it and set up one controlled update per year

D. Create a development environment for writing code and a test environment for configurations, experiments, and load testing

A

D. Create a development environment for writing code and a test environment for configurations, experiments, and load testing

“D is the answer, keep production separated, dev environment to write the code, test environment to test functionality and load before deployments to prod”

18
Q

19.
You support an application running on App Engine The application is used globally and accessed from various device types. You want to know the number of connections. You
are using Stackdnver Monitonng for App Engine. What metnc should you use?

A. flex/connections/current

B. tcp_ssl_proxy/new_connections

C. tcp_ssl_proxy/open_connections

D. flex/instance/connections/current

A

A. flex/connections/current

Ans A
A: Metric for App engine & for version
B: NO, Metrics for Cloud Load Balancing.
C: NO, Metrics for Cloud Load Balancing.
D Metric for App engine & for instance

An App Engine app is made up of a single application resource that consists of one or more services. Each service can be configured to use different runtimes and to operate with different performance settings. Within each service, you deploy versions of that service. Each version then runs within one or more instances, depending on how much traffic you configured it to handle.

if the version runs within one or more instances we need for the version.

https://cloud.google.com/monitoring/api/metrics_gcp#gcp-appengine

19
Q

20.
You support an application deployed on Compute Engine. The application connects to a Cloud SQL instance to store and retrieve data After an update to the application, users report errors showing database timeout messages. The number of concurrent active users remained stable. You need to find the most probable cause of the database timeout.What should you do?

A. Check the serial port logs of the Compute Engine instance

B. Use Stackdriver Profiler to visualize the resources utilization throughout the application

C. Determine whether there is an increased number of connections to the Cloud SQL instance

D. Use Cloud Security Scanner to see whether your Cloud SQL is under a Distributed Denial of Service (DDoS) attack.

A

B. Use Stackdriver Profiler to visualize the resources utilization throughout the application

Ans: B
I don’t find anything about ddos attacks and Cloud security Scanner & databases; its most used with App engine and compute so.

I go with Stackdriver profiler
Show less

20
Q
  1. Your application images are built using Cloud Build and pushed to Google Container Registry (GCR). You want to be able to specify a particular version of your application for
    deployment based on the release version tagged in source control What should you do when you push the image?

A. Reference the image digest in the source control tag

B. Supply the source control tag as a parameter within the image name

C. Use Cloud Build to include the release version tag in the application image

D. Use GCR digest versioning to match the image to the tag in source control

A

C. Use Cloud Build to include the release version tag in the application image

“Ans C
Cloud Build provides the following default substitutions:
$TAG_NAME: build.Source.RepoSource.Revision.TagName”

21
Q
  1. You are on-call for an infrastructure service that has a large number of dependent systems. You receive an alert indicating that the service is failing to serve most of its requests and all of its dependent systems with hundreds of thousands of users are affected As part of your Site Reliability Engineering (SRE) incident management protocol, you declare yourself Incident Commander (IC) and pull in two expenenced people from your team as Operations Lead (OL) and Communications Lead (CL). What should you do
    next?

A. Look for ways to mitigate user impact and deploy the mitigations to production

B. Contact the affected service owners and update them on the status of the incident

C. Establish a communication channel where incident responders and leads can communicate with each other.

D. Start a postmortem, add incident information, circulate the draft internally, and ask internal stakeholders for input

A

C. Establish a communication channel where incident responders and leads can communicate with each other.

“Ans: C
Prepare Beforehand
In addition to incident response training, it helps to prepare for an incident beforehand. Use the following tips and strategies to be better prepared.

Decide on a communication channel
Decide and agree on a communication channel (Slack, a phone bridge, IRC, HipChat, etc.) beforehand.

Keep your audience informed
Unless you acknowledge that an incident is happening and actively being addressed, people will automatically assume nothing is being done to resolve the issue. Similarly, if you forget to call off the response once the issue has been mitigated or resolved, people will assume the incident is ongoing. You can preempt this dynamic by keeping your audience informed throughout the incident with regular status updates. Having a prepared list of contacts (see the next tip) saves valuable time and ensures you don’t miss anyone.”

https://sre.google/workbook/incident-response/

22
Q

23.
You are developing a strategy for monitoring your Google Cloud Platform (GCP) projects in production using Stackdriver Workspaces. One of the requirements is to be able to
quickly identify and react to production environment issues without false alerts from development and staging projects. You want to ensure that you adhere to the principle of
least privilege when providing relevant team members with access to Stackdnver Workspaces.
What should you do?

A. Grant relevant team members read access to all GCP production projects Create Stackdriver workspaces inside each project

B. Grant relevant team members the Project Viewer IAM role on all GCP production projects Create Stackdriver workspaces inside each project

C. Choose an existing GCP production project to host the monitoring workspace Attach the production projects to this workspace Grant relevant team members read access to the Stackdriver Workspace.

D. Create a new GCP monitoring project and create a Stackdriver Workspace inside it Attach the production projects to this workspace Grant relevant team membersread access to the Stackdriver Workspace,

A

D. Create a new GCP monitoring project and create a Stackdriver Workspace inside it Attach the production projects to this workspace Grant relevant team membersread access to the Stackdriver Workspace

“Answer - D
When you want to manage metrics for multiple projects, we recommend that you create a project to be the scoping project for that metrics scope.”

https://cloud.google.com/monitoring/settings/multiple-projects

23
Q

24.
You currently store the virtual machine (VM) utilization logs in Stackdriver. You need to provide an easy-to-share interactive VM utilization dashboard that is updated in real time and contains information aggregated on a quarterly basis. You want to use Google Cloud Platform solutions. What should you do?

A.
1 Export VM utilization logs from Stackdriver to BigQuery.
2. Create a dashboard in Data Studio
3. Share the dashboard with your stakeholders.

B.
1 Export VM utilization logs from Stackdriver to Cloud Pub/Sub
2 From Cloud Pub/Sub. send the logs to a Security Information and Event Management (SIEM) system
3 Build the dashboards in the SIEM system and share with your stakeholders.

C.
1 Export VM utilization logs from Stackdriver to BigQuery
2 From BigQuery. export the logs to a CSV file
3. Import the CSV file into Google Sheets
4 Build a dashboard in Google Sheets and share it with your stakeholders

D.
1 Export VM utilization logs from Stackdriver to a Cloud Storage bucket
2 Enable the Cloud Storage API to pull the logs programmatically.
3 Build a custom data visualization application
4 Display the pulled logs in a custom dashboard

A

A.
1 Export VM utilization logs from Stackdriver to BigQuery.
2. Create a dashboard in Data Studio
3. Share the dashboard with your stakeholders.

“Answer - A
B & C are ruled out straight away. Between A & D, as the ask is real time, D can be ruled out.”

https://cloud.google.com/logging/docs/export/configure_export_v2

24
Q

25.
You need to run a business-critical workload on a fixed set of Compute Engine instances for several months The workload is stable with the exact amount of resources allocated to it.You want to lower the costs for this workload without any performance implications What should you do?

A. Purchase Committed Use Discounts.

B. Migrate the instances to a Managed Instance Group

C. Convert the instances to preemptible virtual machines.

D. Create an Unmanaged Instance Group for the instances used to run the workload

A

A. Purchase Committed Use Discounts.

https://cloud.google.com/compute/vm-instance-pricing#general-purpose_machine_type_family

25
Q

26.
You are part of an organization that follows SRE practices and principles. You are taking over the management of a new service from the Development Team, and you conduct
a Production Readiness Review (PRR) After the PRR analysis phase, you determine that the service cannot currently meet its Service Level Objectives (SLOs). You want to ensure that the service can meet its SLOs in production.
What should you do next?

A.
Adjust the SLO targets to be achievable by the service so you can bring it into production

B.
Notify the development team that they will have to provide production support for the service

C.
Identify recommended reliability improvements to the service to be completed before handover

D.
Bring the service into production with no SLOs and build them when you have collected operational data

A

C.
Identify recommended reliability improvements to the service to be completed before handover

“So C is correct.

According to SRE book, next phase of conducting PRR in Simple PRR model is to select items in PRR to improve before hand over the service to SRE team.”

https://sre.google/sre-book/evolving-sre-engagement-model/#improvements-and-refactoring-xqsrUdcyO
Show less

26
Q

27.
You are running an experiment to see whether your users like a new feature of a web application. Shortly after deploying the feature as a canary release, you receive a spike in the number of 500 errors sent to users, and your monitoring reports show increased latency. You want to quickly minimize the negative impact on users. What should you do first?

A. Roll back the experimental canary release

B. Start monitoring latency, traffic, errors, and saturation

C. Record data for the postmortem document of the incident

D. Trace the origin of 500 errors and the root cause of increased latency.

A

A. Roll back the experimental canary release

“Agree with A, that is even why Spinnaker has “Manual judgment” stage that if Canary deployment seems dangerous, it can be immediately cancelled.”

27
Q
  1. You are responsible for creating and modifying the Terraform templates that define your Infrastructure Because two new engineers will also be working on the same code, you
    need to define a process and adopt a tool that will prevent you from overwriting each other’s code You also want to ensure that you capture all updates in the latest version.
    What should you do?

A.
•Store your code in a Git-based version control system.
•Establish a process that allows developers to merge their own changes at the end of each day
•Package and upload code to a versioned Cloud Storage basket as the latest master version,

B.
•Store your code in a Git-based version control system.
•Establish a process that includes code reviews by peers and unit testing to ensure integrity and functionality before integration of code
•Establish a process where the fully integrated code in the repository becomes the latest master version

C.
•Store your code as text files in Google Drive in a defined folder structure that organizes the files
•At the end of each day, confirm that all changes have been captured in the files within the folder structure
•Rename the folder structure with a predefined naming convention that increments the version

D.
•Store your code as text files in Google Drive in a defined folder structure that organizes the files.
•At the end of each day. confirm that all changes have been captured in the files within the folder structure and create a new zip archive with a predefined naming convention
•Upload the zip archive to a versioned Cloud Storage bucket and accept it as the latest version
tvt_vn/ebay

A

B.
•Store your code in a Git-based version control system.
•Establish a process that includes code reviews by peers and unit testing to ensure integrity and functionality before integration of code
•Establish a process where the fully integrated code in the repository becomes the latest master version

“B peer review and source code management tool required”

28
Q

29.
You support a high-traffic web application with a microsemce architecture The home page of the application displays multiple widgets containing content such as the current weather, stock prices, and news headlines. The main serving thread makes a call to a dedicated microservice for each widget and then lays out the homepage for the user. The
microservices occasionally fail; when that happens, the serving thread serves the homepage with some missing content Users of the application are unhappy if this degraded mode occurs too frequently, but they would rather have some content served instead of no content at all.You want to set a Service Level Objective (SLO) to ensure that the user experience does not degrade too much.
What Service Level Indicator (SLI) should you use to measure this?

A. A quality SLI; the ratio of non-degraded responses to total responses

B. An availability SLI; the ratio of healthy microservices to the total number of microservices

C. A freshness SLI the proportion of widgets that have been updated within the last 10 minutes

D. A latency SLI: the ratio of microservice calls that complete in under 100 ms to the total number of microservice calls.

A

A. A quality SLI; the ratio of non-degraded responses to total responses

“Ans: A
Quality as an SLI
Quality is a helpful SLI for complex services that are designed to fail gracefully by degrading when dependencies are slow or unavailable. The SLI for quality is defined as follows:

The proportion of valid requests served without degradation of service.”

https://cloud.google.com/architecture/adopting-slos

29
Q

30.
You support a multi-region web service running on Google Kubernetes Engine (GKE) behind a Global HTTP/S Cloud Load Balancer (CLB). For legacy reasons, user requests first go through a third-party Content Delivery Network (CDN). which then routes traffic to the CLB. You have already implemented an availability Service Level Indicator (SLI) at the CLB level. However, you want to increase coverage in case of a potential load balancer misconfiguration, CDN failure, or other global networking catastrophe Where should you measure this new SLI? (Choose two.)

A. Your application servers’ logs.

B. Instrumentation coded directly in the client.

C. Metrics exported from the application servers.

D. GKE health checks for your application servers

E. A synthetic client that periodically sends simulated user requests

A

Ans B, E

https://cloud.google.com/architecture/adopting-slos#choosing_a_measurement_method
B > Using client instrumentation.
E > Implementing synthetic testing.

30
Q

31.
Your team is designing a new application for deployment into Google Kubernetes Engine (GKE). You need to set up monitoring to collect and aggregate various application-
level metrics in a centralized location. You want to use Google Cloud Platform services while minimizing the amount of work required to set up monitoring. What should you do?

A. Publish various metrics from the application directly to the Stackdriver Monitoring API. and then observe these custom metrics in Stackdriver

B. Install the Cloud Pub/Sub client libraries, push various metrics from the application to various topics, and then observe the aggregated metrics in Stackdriver

C. Install the OpenTelemetry client libraries in the application, configure Stackdriver as the export destination for the metrics, and then observe the application’s metrics in
Stackdriver

D. Emit all metrics in the form of application-specific log messages, pass these messages from the containers to the Stackdriver logging collector, and then observe
metrics in Stackdriver.

A

A. Publish various metrics from the application directly to the Stackdriver Monitoring API. and then observe these custom metrics in Stackdriver

https: //cloud.google.com/kubernetes-engine/docs/concepts/custom-and-external-metrics#custom_metrics
https: //github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/custom-metrics-stackdriver-adapter/README.md

Your application can report a custom metric to Cloud Monitoring.
You can configure Kubernetes to respond to these metrics and
scale your workload automatically.
Before you can use custom metrics, you must enable Monitoring
in your
Google Cloud project and
install the Stackdriver adapter on
your cluster. After custom metrics are exported to Monitoring,
they can
trigger autoscaling events by Horizontal Pod Autoscaler to change the shape of
the workload.

31
Q

32.
Your application artifacts are being built and deployed via a Cl/CD pipeline. You want the Cl/CD pipeline to securely access application secrets. You also want to more easily
rotate secrets in case of a security breach.
What should you do?

A. Prompt developers for secrets at build time Instruct developers to not store secrets at rest

B. Store secrets in a separate configuration file on Git Provide select developers with access to the configuration file

C. Store secrets in Cloud Storage encrypted with a key from Cloud KMS Provide the Cl/CD pipeline with access to Cloud KMS via IAM

D. Encrypt the secrets and store them in the source code repository Store a decryption key in a separate repository and grant your pipeline access to it

A

C. Store secrets in Cloud Storage encrypted with a key from Cloud KMS Provide the Cl/CD pipeline with access to Cloud KMS via IAM

https://cloud.google.com/security-key-management

32
Q

32.
You support a production service that runs on a single Compute Engine instance. You regularly need to spend time on recreating the service by deleting the crashing instance and creating a new instance based on the relevant image. You want to reduce the time spent performing manual operations while following Site
Reliability Engineering principles. What should you do?
A. File a bug with the development team so they can find the root cause of the crashing instance.
B. Create a Managed instance Group with a single instance and use health checks to determine the system status. Most Voted
C. Add a Load Balancer in front of the Compute Engine instance and use health checks to determine the system status.
D. Create a Stackdriver Monitoring dashboard with SMS alerts to be able to start recreating the crashed instance promptly after it was crashed.

A

B. managed instance groups can be handled when a vm crashed and immediately created new one

33
Q

34.
Your company follows Site Reliability Engineering practices. You are the person in charge of Communications for a large, ongoing incident affecting your customer-facing applications There is still no estimated time for a resolution of the outage. You are receiving emails from internal stakeholders who want updates on the outage, as well as emails from customers who want to know what is happening. You want to efficiently provide updates to everyone affected by the outage
What should you do?

A.
Focus on responding to internal stakeholders at least every 30 minutes Commit to “next update’ times

B.
Provide periodic updates to all stakeholders in a timely manner Commit to a “next update’ time in all communications.

C.
Delegate the responding to internal stakeholder emails to another member of the Incident Response Team. Focus on providing responses directly to customers.

D.
Provide all internal stakeholder emails to the Incident Commander, and allow them to manage internal communications Focus on providing responses directly to
customers

A

B.
Provide periodic updates to all stakeholders in a timely manner Commit to a “next update’ time in all communications.

“Ans: B (the communication lead CAN’T delegate)

When disaster strikes, the person who declares the incident typically steps into the IC role and directs the high-level state of the incident. The IC concentrates on the 3Cs and does the following:

Commands and coordinates the incident response, delegating roles as needed. By default, the IC assumes all roles that have not been delegated yet.
Communicates effectively.
Stays in control of the incident response.
Works with other responders to resolve the incident.”

https://sre.google/workbook/incident-response/

34
Q

35.
Your team uses Cloud Build for all Cl/CD pipelines. You want to use the kubectl builder for Cloud Build to deploy new images to Google Kubernetes Engine (GKE). You need to authenticate to GKE while minimizing development effort.
What should you do?

A.
Assign the Container Developer role to the Cloud Build service account

B.
Specify the Container Developer role for Cloud Build in the cloudbuild yaml file

C.
Create a new service account with the Container Developer role and use it to run Cloud Build

D.
Create a separate step in Cloud Build to retrieve service account credentials and pass these to kubectl

A

A.
Assign the Container Developer role to the Cloud Build service account

Ans A
https://cloud.google.com/build/docs/securing-builds/configure-user-specified-service-accounts

35
Q

36.
You support an application that stores product information in cached memory. For every cache miss, an entry is logged in Stackdriver Logging You want to visualize how often a cache miss happens over time.
What should you do?

A.
Link Stackdriver Logging as a source in Google Data Studio Filter the logs on the cache misses

B.
Configure Stackdriver Profiler to identify and visualize when the cache misses occur based on the logs

C.
Create a logs-based metric in Stackdriver Logging and a dashboard for that metric in Stackdriver Monitoring

D.
Configure BigQuery as a sink for Stackdriver Logging Create a scheduled query to filter the cache miss logs and write them to a separate table

A

C.
Create a logs-based metric in Stackdriver Logging and a dashboard for that metric in Stackdriver Monitoring

Ans C: https://cloud.google.com/logging/docs/logs-based-metrics#counter-metric

36
Q

37.
You need to deploy a new service to production. The service needs to automatically scale using a Managed Instance Group (MIG) and should be deployed over multiple regions. The service needs a large number of resources for each instance and you need to plan for capacity.
What should you do?

A.
Use the n1-highcpu-96 machine type in the configuration of the MIG.

B.
Monitor results of Stackdriver Trace to determine the required amount of resources

C.
Validate that the resource requirements are within the available quota limits of each region

D.
Deploy the service in one region and use a global load balancer to route traffic to this region

A

C.
Validate that the resource requirements are within the available quota limits of each region

https://cloud.google.com/compute/quotas

37
Q

38.
You are running an application on Compute Engine and collecting logs through Stackdriver. You discover that some personally identifiable information (Pll) is leaking into
certain log entry fields All Pll entries begin with the text userinfo You want to capture these log entries in a secure location for later review and prevent them from leaking to
Stackdriver Logging
What should you do?

A. Create a basic log filter matching userinfo. and then configure a log export in the Stackdriver console with Cloud Storage as a sink

B.
Use a Fluentd filter plugin with the Stackdriver Agent to remove log entries containing userinfo. and then copy the entries to a Cloud Storage bucket

C.
Create an advanced log filter matching userinfo. configure a log export in the Stackdriver console with Cloud Storage as a sink, and then configure a log exclusion with user info as a filter

D.
Use a Fluentd filter plugin with the Stackdriver Agent to remove log entries containing userinfo. create an advanced log filter matching userinfo. and then configure a log
export in the Stackdriver console with Cloud Storage as a sink.

A

B.
Use a Fluentd filter plugin with the Stackdriver Agent to remove log entries containing userinfo. and then copy the entries to a Cloud Storage bucket

https://medium.com/google-cloud/fluentd-filter-plugin-for-google-cloud-data-loss-prevention-api-42bbb1308e76

38
Q

39.
You have a Cl/CD pipeline that uses Cloud Build to build new Docker images and push them to Docker Hub. You use Git for code versioning After making a change in the Cloud Build YAML configuration, you notice that no new artifacts are being built by the pipeline. You need to resolve the issue following Site Reliability Engineering practices.
What should you do?

A.
Disable the Cl pipeline and revert to manually building and pushing the artifacts

B.
Change the Cl pipeline to push the artifacts is Container Registry instead of Docker Hub

C.
Upload the configuration YAML file to Cloud Storage and use Error Reporting to identify and fix the issue

D.
Run a Git compare between the previous and current Cloud Build Configuration files to find and fix the bug

A

D.
Run a Git compare between the previous and current Cloud Build Configuration files to find and fix the bug

“Run a Git compare between the previous and current Cloud Build Configuration files to find and fix the bug”

39
Q

40.
Your company follows Site Reliability Engineering pnnciples. You are writing a postmortem for an incident, triggered by a software change, that severely affected users. You want to prevent severe incidents from happening in the future.
What should you do?

A.
Identify engineers responsible for the incident and escalate to their senior management

B.
Ensure that test cases that catch errors of this type are run successfully before new software releases

C.
Follow up with the employees who reviewed the changes and prescribe practices they should follow in the future.

D.
Design a policy that will require on-call teams to immediately call engineers and management to discuss a plan of action if an incident occurs

A

B.
Ensure that test cases that catch errors of this type are run successfully before new software releases

“Agree with B. I find this answer in “Site Reliability Engineering: How Google Runs Production Systems”.