Overall Flashcards
Data System Categories
Database (SQL)
Message Bus (Kfafa)
Cache (Redis)
Logging (Logstash)
The interface is blurring e.g. all data system can provide SQL style “query” , or Database level reliability
What is Distributed System?
The functionality of an application is distributed to different components within a system. It is the result of many patterns.
Distribution of Functionality (components)
Reuse/Share/Common Design/Language/Independent
Monolithic vs Micro-services
Divide and Conquer
Separation of Concern
Do one thing and do one thing well
Allow created a new, special-purpose data
system/Application from smaller, general-purpose components (Redis, Kafka, mySQL, ElasticSearch)
Fault vs Failure
Faults (HW or SW) refer to issues of component/s. System should still be functioning due to fault-tolerant.
Failure refers to system failure (down time)
Testing should purposely introduce faults to test system’s fault tolerance (Netflix Chaos Monkey)
Load Parameters
Numbers that describe system loads
1. CCR (concurrent requests to the service)
Define various load parameters of your system and how to handle them efficiently
Response Time =
Delays (Queues, Network) +
Latency (Waiting in idle) +
Service time
Response varies even for a same request
We therefore need to think of response
time not as a single number, but as a distribution of values that you can measure (histogram)
Long Tailed Distribution …. the small percentage having a long response time
Small percentage * large customer base ==> big number of customers having long response time!
Load Parameters
Numbers that describe system loads
- CCR (concurrent requests to the service)
- Read/Write ratio/Volume etc
Define various load parameters of your system and how to handle them efficiently
An architecture that scales well for a particular application is built around assumptions
of which operations will be common and which will be rare—the load parameters.
Long Tail Distribution
Tail Latency (Latency of P999)
Large Latency for small number of requests
User making most request would be affected by Tail Latency, and these are heavy users==> most valuable usrs
Service SLA/SLO/SLI
Define the response time at p50 (median), p99 or p999
Long Tail Distribution
Tail Latency (Latency of P999)
Large Latency for small number of requests
Tail Latency Amplification
User making most request would be affected by Tail Latency, and these are heavy users==> most valuable users
Head of Line Blocking
A VM can process N tasks a the same time (limited by CPUs, threads etc).
The tasks at the head of the queue taking a long time (due to its large dataset), contributing to the “latency” time of the waiting tasks, even though the service time for the waiting tasks would be quick (small datasets)
Dedicated a certain number of threads for long executing tasks and rest for small tasks?
Scale Cube
X: Horizontal scale / scale by duplication /Scale Up|Out
Y: Scale by decomposition / Microservices /Distribution
Z: Scale by data/network partition
A big ball of mud — over complicated software (high collaborators)
Low maintainability
Large state space #ifdef or if(edge_case) tangled dependencies (loop, mesh) hacks/workaround here and there inconsistent naming/convention Hidden assumptions
Remove “accidental complexity”, the complexity not from the problem but from implementation of the solution
Design patterns
Abstraction
Clean interface
…
Service mesh
–an infrastructure layer that address the cross-cutter concerns for multiple microservices
The more microservices an application is made of, the more the application needs a service mesh layer.
Without a service mesh,
… each microservice implements business logic and cross cutting concerns (CCC) such as logging, caching, security, load balancing by itself.
With a service mesh,
… many CCCs like traffic metrics, routing, and encryption are moved out of the microservice and into a proxy. business logic and business metrics stay in the microservices. Incoming and outgoing requests are transparently routed through the proxies. In addition to a layer of proxies (data plane), a service mesh adds a so-called control plane. It distributes configuration updates to all proxies and receives metrics collected by the proxies for further processing, e.g. by a monitoring infrastructure such as Prometheus.
WHAT: Observability vs (>) Monitoring
Observability: the ability of asking questions (splunk queries) from outside to understand the inside of the system.
3 Pillars:
Metrics
Traces
Logs
[ORIGIN] Twitter engineering team said a few years ago on their blog, the 3 pillars of observability are
Metrics (alerting)
Traces (across distributed systems and services)
Logs (aggregation/analytics)
https: //blog.twitter.com/engineering/en_us/a/2016/observability-at-twitter-technical-overview-part-i.html
https: //www.scalyr.com/blog/three-pillars-of-observability/