Enterprise Flashcards

Question

What is validated learning?

Answer 1

Learning (quantitatively) how to build a sustainable business, everything else is a waste of time

Answer 2

No, in agile development the customer is known whereas the specifications are not. In a startup, neither is known

Answer 3

It consists of a single program run by a single process formed from a collection of modules that communicate by procedure calls

Answer 4

All the code is in one language in one place, this makes it easier as it stops unnecessary translation between many different languages

Answer 5

It is straightforward to build the system under test as a single executable against which a suite of tests can be run automatically. It is then simple to locate the error

Answer 6

Vertical scaling, whuch invloves replacing an existing machine with a new, more powerful one that can run the monolithic system better

Answer 7

Using a centralised database, which makes it easier to ensure the accuracy, completeness, and consistency of data

Answer 8

- Commits and the database is moved to a new consistent state - Aborts and the database is restored to its previous consistent state

Answer 9

- Atomicity: transactions either succeed completely or fail completely - Consistency: transactions begin with the database in one consistent state and end in another consistent state - Isolation: effect of performing transactions concurrently is the same as performing them sequentially - Durability: Once a transaction has succeeded, its effects persist, even in the presence of system failures

Answer 10

The first version should always be viewed as a prototype

Answer 11

A complex system that works evolves from a simple system that worked. For monolithic MVPs, it implies starting with a simpler design that can be refined over time

Answer 12

Functionality should not be developed unless it is required to avoid unnecessary complexity

Answer 13

A system's design mirrors the structure of the organisation that created it. For monolithic MVPs, the design will reflect the communication structure of the team.

Answer 14

Structuring the organisation based on the structure of the system design, e.g. one department for each module

Answer 15

Development teams should use their own product to identify and address issues, ensuring its quality and usability

Answer 16

Only what is necessary to learn whether the plan is correct or not

Answer 17

Does it minimize total time through the build-measure-learn loop?

Answer 18

Ones about per customer behaviours that can be measured

Answer 19

The eBay V2 architecture consisted of a single 3.4 million lines-of-code C++ library, which hit the 16k compiler limit of the number of methods per class.

Answer 20

- No one starts with microservices. - Past a certain scale, everyone ends up with microservices. - Most enterprises (< 1%) never reach the scale where microservices become necessary.

Answer 21

A monolithic architecture is appropriate, designed to be "just enough" to meet near-term, evolving customer needs as cheaply as possible.

Answer 22

"Just enough" architecture employs simple, familiar technology. This is often a rapid prototyping framework such as Ruby on Rails or PHP, allowing quick iterations and minimal complexity.

Answer 23

Buying software is typically faster, cheaper, and better than developing it in-house. Open-source solutions are preferred where possible to avoid unnecessary reinvention.

Answer 24

A microservices architecture is appropriate, allowing teams to design, develop, deploy, and operate their services independently. This provides scalability and flexibility as the enterprise grows.

Answer 25

Incremental changes should be as small as possible. Large changes should be decomposed into smaller ones while maintaining backward/forward compatibility of data and interfaces to minimize disruption.

Answer 26

The system of record is the single service that owns any given piece of data. Any other copies of that data are read-only, non-authoritative cached versions.

Answer 27

The architecture should be stable, focusing on sustainable, incremental improvements in functionality and efficiency rather than large-scale changes.

Answer 28

It consists of multiple programs that run as multiple processes and communicate by sending messages over a network

Answer 29

Representational State Transfer (REST) - a conventional form of the HyperText Transfer Protocol (HTTP)

Answer 30

- Document for file-like resources, where a GET operation reads the resource, a PUT operation updates it, and a DELETE operation deletes it - Controller for external resources, where a POST operation causes the resources to carry out some task - Collection for directory-like resources, where a GET operation lists the resources in the directory and a POST operation creates a new one with an invented name - Store for directory-like resources, where a GET operation lists the resources in the directory and a PUT operation creates a new one with a given name

Answer 31

In a distributed database made up of databases accessed by individual microservices. The distributed organisation makes it difficult to ensure that the accuracy, completeness, and consistency of data is maintained

Answer 32

- A coordinator transaction asks a number of participant transactions to vote on whether they are prepared to commit to a change - Each participant holds locks on its data involved in the transaction until the coordinator decides to commit or abort - If all participants vote to commit, then the coordinator instructs all participants to commit. Otherwise, if any participant votes to abort or times out, the coordinator instructs all participants to abort

Answer 33

- A coordinator transaction asks each participant to commit or abort in sequence - Each participant holds locks on its data involved in the transaction only until it decides to commit or abort - If a participant aborts, the sequence ends immediately and compensating transactions are made to undo the work of those participants that have already committed - A saga sacrifices atomicity and relies on eventual consistency

Answer 34

By horizontal scaling, which involves adding machines, each of which can run one or more microservices instances. The number of machines allocated to run one microservice need to be the same as the number allocated to run another

Answer 35

It must be done quickly, as taking too long may lose customers

Answer 36

Initially, the modules of the monolith are put behind a facade, which serves as a proxy that manages all communication with and between them. Subsequently, modules may be replaced by microservices one-at-a-time, updating the facade after each replacement

Answer 37

It suggests that engineers tend to overcomplicate their second system. In microservice migration, teams must avoid unnecessary complexity when breaking apart a monolith

Answer 38

Teams should be small enough to be fed with two pizzas. In microservices, small, independent teams are ideal for maintaining and developing individual services efficiently

Answer 39

Making a mess of code, unstructured. Migration to a microservice when the monolith has become just this

Answer 40

Coding functionality in a quick and dirty fashion. Migration to a microservice may occur due to getting into technical debt and needing to code properly

Answer 41

- Componentization via services - Organized around business capabilities - Products not Projects - Smart endpoints and dumb pipes - Decentralised governance - Decentralised data management -Infrastructure automation - Design for failure - Evolutionary design

Answer 42

- Independently replaceable - Independently upgradable

Answer 43

The focus is on the customer not on internal metrics.

Answer 44

Endpoints should be smart, with the pipes around it being simple (dumb).

Answer 45

Every service should be responsible for its own data store. You can only talk to another data store through its API

Answer 46

You must assume things are going to break. Each part of the distributed must be tested to ensure other parts are not affected if something breaks

Answer 47

Wakeling: Size of the API Fowler: There is a wide, undefined range of how big a microservice is (i.e. 4 people, 200 services or 30 people, 60 services)

Answer 48

- Rapid provisioning - Basic monitoring - Rapid application - DevOps culture

Answer 49

Monorepo: One giant repo for all microservices - any commit triggers the production of multiple microservices Multirepo: One repo per service - any commit triggers the production of a single service

Answer 50

Feature-based development: Developers create new branches based on the needs of the project. Long-lived feature branches may be created that are merged back weeks or months later Trunk-based development: Developers work on a single main branch. Short-lived branches may be created and merged back within minutes

Answer 51

- Run commit tests locally - Wait for commit tests - Avoid commits on a broken build - Never go home on a broken build - Be prepared to revert - Avoid commenting out tests - Take responsibility for breakages (EXPLAIN?)

Answer 52

- Step back to safety - Share changes easily - Store changes somewhere safe

Answer 53

- Mono-repo: Everything in one big repository - Multi-repo: Independent things in repositories - Multi-repo' - Interdependent things in repositories

Answer 54

- Step back to safety by stepping back all components; - Share changes easily by changing any component; - Stores changes somewhere safe by saving all components/dependencies together.

Answer 55

The communication between components and the specification of which versions of the components work together are not stored anywhere.

Answer 56

Two solutions to the multi-repo problem are to build independently deployable components that: - Have fixed, well-understood APIs; - Have flexible, backwards/forwards compatible APIs

Answer 57

The solutions to the multi-repo problem have the three benefits that a version control system provides because it is possible to: - Step back to safety by stepping back any component; - Share changes easily by coordinating updates — not easy; - Store changes somewhere safe by storing components separately.

Answer 58

Components cannot be developed independently or deployed independently.

Answer 59

Continuous Integration (CI) is the practice of quickly integrating newly developed code with the rest of the application code. This saves time when the application is ready to be released. This process is usually automated and produces a build artefact at the end of the process.

Answer 60

- The alpha release - The beta release - The release candidate - The release

Answer 61

- The development environment: The work of a single development team is put together. Updated throughout a two-week sprint. - The staging environment: The work of multiple development teams is put together. Updated at the end of a two-week sprint - The production environment: The work of multiple development teams becomes available to customers. Updated when the business considers the time is right.

Answer 62

An approach in software development that emphasizes moving testing activities earlier in the development process for improved software quality, better test coverage, continuous feedback and a faster time to market.

Answer 63

- Unit tests - Service tests - End-to-end tests

Answer 64

Unit tests are run to ensure that functions work properly. There may be thousands of unit tests, performed in seconds by testing frameworks.

Answer 65

Service tests are run to ensure that services work properly. There may be hundreds of service tests, performed in a few minutes by testing frameworks

Answer 66

End-to-end tests are run to ensure that the application works properly. There may be tens of end-to-end tests, performed in several minutes by mimicking user interaction, often through a GUI.

Answer 67

This phenomenon appears when test automation mainly focuses on E2E testing with fewer IT and even fewer UT. With software testing ice cream cones, the majority of testing is done manually. UI automated tests are a close second, integration tests in the middle, with unit testing lagging completely. This is not scalable. This is something to avoid.

Answer 68

A test that fails because another service fails

Answer 69

A test that sometimes fails because another service fails -perhaps due to a time-out or race condition

Answer 70

The idea that over time we become so accustomed to things being wrong that we start to accept them as being normal and not a problem. This means that we need to find and eliminate flaky tests as soon as we can before we start to accept failing tests as being normal and not a problem — “it always fails like that”.

Answer 71

A build light indicator displays the current status of a continuous integration pipeline — green when the build is successful, and red when it fails. As the number of build targets increases, build light indicators have to be replaced by monitor screens throughout the building to display the current status of a continuous integration pipeline

Answer 72

An anti-pattern of software development that brings together the pieces of a software system (far too) late.

Answer 73

The point of Rule 1: run commit tests locally is that the deployment pipeline is a valuable shared resource that one should avoid blocking with unnecessary test failures.

Answer 74

The point of Rule 2: Wait for the results is that those who make changes are there ready to fix any problems immediately.

Answer 75

The point of Rule 3: Fix or Revert Failures Within 10 Minutes is to avoid blocking useful progress by others.

Answer 76

The point of Rule 4: If a team mate breaks the rules, revert their changes is to avoid others blocking useful progress.

Answer 77

The point of Rule 5: If someone else notices you caused a failure before you notice, it’s a build sin is to encourage you to pay more attention.

Answer 78

The point of Rule 6: Once commit passes, move on to you next task is that rapid, automated testing frees up time to do new, useful work.

Answer 79

The point of Rule 7: If any test fails, it is the responsibility of the committer is that someone takes responsibility for a failure and its fix

Answer 80

The point of Rule 8: It is the responsibility of everyone who may be responsible to agree who will fix a failure is that someone (of many people) takes responsibility for a failure and its fix.

Answer 81

The point of Rule 9: Monitor the progress of your change so that the software can be rejected as soon as it is shown not to be in a releasable state

Answer 82

The point of Rule 10: address any pipeline failure immediately is to keep the pipeline clear for other changes, whatever that costs.

Answer 83

Continuous Delivery (CD) automatically moves a software product from a source code repository through to the staging environment. At the press of a “release” button, it could be moved on to the production environment for use by customers.

Answer 84

Continuous Deployment (CD) automatically moves a software product from a source code repository to the production environment. Without the need to press a “release” button, it is available for use by customers.

Answer 85

- Create a repeatable process - Automate almost everything - Version control for everything - If it hurts, do it more frequently - Build quality in - Done means released - Everyone is responsible - Continuous improvement

Answer 86

A small percentage of customer traffic is sent to a new, working interface in the production. If customers appear unhappy, all customer traffic is sent to the old interface

Answer 87

A small percentage of customer traffic is sent to a new maybe working version in the production environment. If customers appear unhappy, all customer traffic is sent to the old version.

Answer 88

The production environment (blue) is exchanged with the staging environment (green) - this may be done by updating a routing table. If customers appear unhappy, the exchange is reversed; otherwise, it is made permanent.

Answer 89

Through fast, automated feedback on the production readiness of your applications every time there is a change — to code, infrastructure, or configuration.

Answer 90

The condition software should always be in is production-ready or releasable

Answer 91

Continuous delivery helps to avoid the biggest source of waste in the software development process because so much easier to get new, experimental features into production.

Answer 92

All the time, not just once the software has been developed.

Answer 93

Everyone is responsible for quality.

Answer 94

Keeping the system working and in a good state is more important than delivering functionality.

Answer 95

Continuous delivery reduces the risk of release because releasing a small, extensively tested change, and being able to revert immediately is not a risky thing to do

Answer 96

Cloud computing is the delivery of computing services—including servers, storage, databases, networking, software, and more—over the internet ("the cloud"). This allows businesses to avoid the costs of owning and maintaining physical data centers and servers.

Answer 97

Cloud providers centralize computing resources in large data centers, allowing them to optimize resource usage, reduce operational costs, and pass savings onto customers. This model mirrors the way electricity utilities function.

Answer 98

A staging environment is a scaled-down replica of the production environment where applications are tested before deployment. In cloud computing, many companies rent staging environments rather than owning them.

Answer 99

- Broad network access - On-demand self-service - Measured service - Rapid elasticity - Resource pooling

Answer 100

It means cloud services are accessible over standard networks, including Virtual Private Networks (VPNs), allowing users to connect from anywhere.

Answer 101

This allows users to provision and manage computing resources as needed without requiring human interaction with the provider, typically through a web interface or API.

Answer 102

Cloud providers monitor and measure resource usage (such as compute power and storage) for billing and optimization purposes.

Answer 103

Rapid elasticity allows users to quickly scale computing resources up or down based on demand, ensuring efficient resource utilization.

Answer 104

Cloud providers allocate virtual machines to physical ones dynamically, enabling multitenancy where multiple customers share the same infrastructure securely.

Answer 105

- Serverful computing - Serverless computing

Answer 106

Serverful computing involves renting virtualized computing resources where users manage their applications and infrastructure.

Answer 107

- Infrastructure-as-a-Service (IaaS): Provides raw computing resources (e.g., virtual machines). - Platform-as-a-Service (PaaS): Provides computing platforms with built-in tools and services. - Software-as-a-Service (SaaS): Provides access to applications on a subscription basis.

Answer 108

VMs are software-based emulations of physical computers, managed by a hypervisor, that allow multiple operating systems to run on a single physical server.

Answer 109

Containers are lightweight, isolated environments that run applications without needing a full operating system. They share the host OS kernel, making them more efficient than VMs.

Answer 110

Serverful computing charges customers based on resource allocation on a rental basis, similar to renting a car for transportation.

Answer 111

Microservices in a serverful model can be implemented by running each microservice on either a dedicated virtual machine or within a container.

Answer 112

Serverless computing abstracts infrastructure management away from developers, allowing them to deploy code that runs only when needed, without provisioning or managing servers.

Answer 113

- Backend-as-a-Service (BaaS): Provides pre-built backend services (e.g., authentication, database storage). - Function-as-a-Service (FaaS): Runs code in response to triggers or events without requiring persistent infrastructure.

Answer 114

Serverless implementations use "hidden" containers to execute function code on-demand, with cloud providers managing scaling and resource allocation.

Answer 115

Serverless computing charges customers based on execution time (pay-as-you-go), similar to paying for a taxi ride rather than renting a car.

Answer 116

Microservices can be implemented by mapping a single microservice to a single function instance or multiple function instances. The latter may introduce maintenance and performance challenges.

Answer 117

- Maintenance complexity (tracking multiple function instances). - Performance issues (cold start delays and instance lifecycle management).

Answer 118

To get a new server at the FT ready for code to be deployed took in: - An FT data centre = 120 days - An AWS data centre = minutes

Answer 119

One should worry less about vendor lock-in than about moving slowly by choosing to do everything oneself.

Answer 120

The deployment frequency before the FT moved to the cloud was 12 release per year and afterwards was about 30,000 changes per year.

Answer 121

You do not have to choose between speed and stability — moving fast means breaking things less, and fixing things faster.

Answer 122

You should use a queue to avoid coupling with synchronous calls — producers and consumers are not reliant on each other.

Answer 123

One should you focus on resilience and redundancy when developing a distributed system.

Answer 124

One should adopt business-focused monitoring because these few key capabilities show that fundamentally, the system is OK.

Answer 125

One should test infrastructure recovery plans because until you do, you cannot be sure that the plan works.

Answer 126

The team that builds a system has to be the one that runs it too because only the team than works on a system day-to-day has a chance of working out what is wrong with it and you build things differently if you have to respond a 3am.

Answer 127

DevOps is a set of practices that combines software development (Dev) and IT operations (Ops). Its goal is to shorten the system development life cycle while delivering high-quality software continuously.

Answer 128

DevOps is in a state of flux, with some viewing it as a concrete methodology and others as an evolving concept.

Answer 129

This phrase, attributed to Amazon CTO Werner Vogels, emphasizes that developers should take operational responsibility for their code. This approach fosters better service quality, customer interaction, and continuous improvement through feedback loops.

Answer 130

Culture Automation Lean Measurement Sharing

Answer 131

A strong DevOps culture promotes collaboration with shared values, reducing conflicts and fostering innovation.

Answer 132

A blameless culture focuses on learning from mistakes instead of assigning blame. This promotes continuous improvement and knowledge sharing.

Answer 133

At NUMMI, Toyota retrained GM workers with a high-trust, continuous improvement culture, which led to the production of the highest quality cars in America within three months.

Answer 134

Automation minimizes manual tasks, reducing the probability of deployment failures and increasing operational efficiency.

Answer 135

Automation ensures repeatable, documented processes, improving velocity, transparency, and freeing up time for innovation.

Answer 136

Jidoka, or "automation with a human touch," integrates human wisdom into automation. Machines can detect abnormalities and halt processes, while human operators can intervene when necessary.

Answer 137

Lean in DevOps focuses on eliminating waste to enhance efficiency and reduce unnecessary delays.

Answer 138

- Limiting work in progress (WIP) to prevent interruptions. - Reducing handoffs to enhance communication and coordination.

Answer 139

Kanban boards visualize work, helping teams identify inefficiencies such as waiting, overproduction, and unnecessary motion.

Answer 140

Measurement involves obsessively monitoring metrics and logs to detect and resolve problems quickly.

Answer 141

Metrics are recorded values that measure system behavior over time, providing insights into system performance and potential issues.

Answer 142

Toyota marks factory floors in tenths of their length to track bottlenecks and guide managers to areas needing improvement.

Answer 143

Sharing knowledge fosters collaboration between development and operations teams, leading to quicker problem detection and resolution.

Answer 144

Teams can build relationships by inviting members from different departments to meetings, informal gatherings, and problem-solving discussions.

Answer 145

Genchi Genbutsu ("go and see") is a Toyota principle where managers observe processes firsthand, ensuring a deeper understanding of operations and fostering collaboration.

Answer 146

The differing concerns of developers and operators are agility and stability.

Answer 147

DevOps in its purest form is about breaking down the (metaphorical) wall between developers and operators.

Answer 148

One should reduce organisation silos because success comes from cooperation between cross-functional teams.

Answer 149

One should accept failure as normal because any system that humans build is inherently unreliable.

Answer 150

One should implement gradual change because it is hard to find bugs in large, million-line changes.

Answer 151

One should leverage tooling and automation because work must be turned into repeatable patterns that can be automated.

Answer 152

One should measure everything because we must have numbers to support the DevOps investment and there must be clear metrics for success.

Answer 153

SRE reduces organisational silos by sharing ownership with developers and using the same shared tooling and by adopting measures of availability that force conversations between SRE and development.

Answer 154

SRE accepts failure as normal by using Service Level Objectives (SLOs), which force one to admit a system may be unreliable, and by conducting blameless postmortems when that unreliability occurs.

Answer 155

SRE implements gradual change by moving fast to reduce the cost of failure through small iterative deployments.

Answer 156

SRE leverages tooling and automation by ensuring that tasks done manually this year should be done automatically next year, so eliminating toil.

Answer 157

According to Vargo, SRE measures everything by not only measuring system metrics, such as reliability, but also human metrics, such as the amount of toll

Answer 158

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. Its main goals are to create scalable and highly reliable software systems. It emerged from Google as a way to ensure that large-scale services remain reliable and scalable.

Answer 159

SRE can be seen as an implementation of DevOps principles. In particular, SRE teams often implement the CALMS principles—Culture, Automation, Lean, Measurement, and Sharing—as a structured way of improving reliability and collaboration between development and operations.

Answer 160

Culture: Promoting a culture of shared responsibility. Automation: Automating repetitive tasks. Lean: Reducing waste and inefficiency. Measurement: Tracking performance with metrics. Sharing: Open communication and knowledge exchange.

Answer 161

SRE often has its own team or embeds engineers within development teams. A key cultural practice is conducting blameless postmortems after incidents, which aim to foster a safe environment for learning rather than assigning blame.

Answer 162

A blameless postmortem is a document created after an incident that focuses on understanding what went wrong and how to prevent it in the future, without blaming individuals. This encourages transparency and continuous improvement.

Answer 163

Organizational learning in SRE involves circulating postmortem reports to improve collective knowledge and resilience. It transforms failures into opportunities for system-wide improvement.

Answer 164

SREs use software to eliminate toil—manual, repetitive operations work that adds little enduring value. Automation allows teams to focus more on engineering tasks than on reactive support.

Answer 165

Toil is work that is manual, repetitive, automatable, tactical, devoid of lasting value, and scales linearly with service growth. Reducing toil through automation is a key objective of SRE.

Answer 166

At companies like Google, SREs are expected to spend at least 50% of their time on engineering work, with the remainder handling support tickets, incidents, and on-call duties.

Answer 167

Pager fatigue occurs when SREs are overwhelmed with too many incidents during on-call shifts. Best practice suggests handling no more than two incidents per 8–12 hour shift to allow thorough resolution and proper postmortems.

Answer 168

SRE reduces waste by limiting work-in-progress using control loops like error budgets, and by polarizing time—clearly separating development and operational tasks by time blocks.

Answer 169

An error budget is defined as the difference between the agreed reliability level (SLO) and the observed reliability. If the system performs within this budget, new features can be released; if not, development is paused to focus on stability.

Answer 170

Polarizing time means dedicating distinct periods solely to development or operations work, reducing context-switching and enhancing focus and productivity.

Answer 171

SREs obsessively monitor a few key metrics chosen based on user needs, intuition, and experience. These metrics guide decisions and inform about service health.

Answer 172

SLI (Service Level Indicator): A quantitative metric that measures service performance. SLO (Service Level Objective): A target value or range for an SLI. SLA (Service Level Agreement): A formal agreement that outlines the consequences of meeting or failing to meet an SLO.

Answer 173

Obsessive monitoring ensures that the system remains within acceptable performance boundaries and enables early detection and resolution of problems before they impact users.

Answer 174

Through open communication and the dissemination of knowledge, tools, and techniques. Both development and operations must share insights to align their objectives and enhance service reliability.

Answer 175

Sharing tools ensures consistency in managing environments and enables self-service deployments, reducing dependencies and delays in workflows.

Answer 176

Knowledge sharing ensures that development is aware of operational concerns and vice versa. This bidirectional communication improves system design and reliability.

Answer 177

A good alert is one that is actionable, and is for something that could not be fixed without a human being - if automated remediation is possible, at least try that. A Site Reliability Engineer cares about good alerts, because they lose sleep over bad ones.

Answer 178

A traditional Network Operations Centre (NoC) or war room is seen as a reliability theatre that impresses only the general public. An SRE cares about a reliability theatre because it may limit the effectiveness of incident response.

Answer 179

A snowflake is a production server that is kept running through regular manual configuration tweaks made via the command line. An SRE cares about snowflakes because they are hard to reproduce and debug

Answer 180

Pets are virtual (snowflake) servers with names that need individual attention; cattle are virtual servers with numbers that need group attention; poultry are virtual containers with numbers that need group attention. An SRE cares about pets, cattle and poultry because of their (decreasing) administrative cost.

Answer 181

Autonomous > automated because it is less work. An SRE cares about this because autonomous systems can take away a world of pain from the on-call rotation.

Answer 182

The advantages of embedding an SRE in a development team are that it builds trust and development and SRE gets input into system design from the very beginning

Answer 183

The right number of nines is a decision made on the basis of how much downtime the business can tolerate.

Answer 184

It is dangerous to improve a system without revising its Service Level Agreement because customers will consider the delivered level of reliability to be the agreed level