Final Part 3 Flashcards
Why is trouble shooting often viewed as an innate skill that some people have and others don’t?
For those who troubleshoot often, it’s an ingrained process; explaining how to troubleshoot is difficult, much like explaining how to ride a bike.
However, troubleshooting is both learnable and teachable.
What two factors explain why novices are often tripped up by troubleshooting?
An understanding of how to troubleshoot normally (i.e., without any particular system knowledge) and a solid knowledge of the system.
Hypothetico-deductive method
Given a set of observations about a system and a theoretical basis for understanding system behavior, we iteratively hypothesize potential causes for the failure and try to test those hypotheses.
What are the steps in a ideal troubleshooting model?
We’d start with a problem report telling us that something is wrong with the system.
Then we can look at the system’s telemetry and logs to understand its current state.
This information, combined with our knowledge of how the system is built, how it should operate, and its failure modes, enables us to identify some possible causes.
Postmortem
A written record of an incident,
its impact, the actions taken to mitigate or resolve it,
the root cause(s),
the follow-up actions to prevent the incident from recurring.
Why are Postmortems needed?
If incidents don’t have some formalized process of learning from, the incidents can multiply in complexity or even cascade, overwhelming a system and its operators and ultimately impacting our users.
Reasons to monitor a system include
Analyzing long-term trends
Comparing over time or experiment groups
Alerting
The four golden signals of monitoring
Latency, traffic, errors, and saturation
If you can only measure four metrics of your user-facing system, focus on
these four.
Latency
The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. It’s important to track error latency, as opposed to just filtering out errors.
Traffic
A measure of how much demand is being placed on your system, measured in a high-level system-specific metric.
Errors
The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly, or by policy (for example, “If you committed to one-second response times, any request over one second is an error”).
Saturation
How “full” your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O).
If you measure all four golden signals and page a human when one signal is problematic
Your service will be at least decently covered by monitoring
Why is boring a positive attribute of software?
You don’t want programs to be spontaneous and interesting; you want them to stick to the script and predictably accomplish their business goals.
What is the difference between essential complexity and accidental complexity?
Essential complexity is the complexity inherent in a given situation that cannot be removed from a problem definition, whereas accidental complexity is more fluid and can be resolved with engineering effort.
Jenkins
A continuous integration server written in Java. You can use it for testing and reporting changes in near real-time. Being a developer, it will help you to find and solve bugs in your code rapidly and automate the testing of their build.
Jenkins Features
– Free Open-Source Tool
– Integrate all your DevOps stages with the help of around 1000 plugins
– Script your pipeline having one or more build jobs into a single workflow
– Easily start your Jenkins with its WAR file
– Provides multiple ways of communication: web-based GUI, CLI and REST API