Systems Design Flashcards
What is ML systems design?
It takes a systems approach to MLOps, which means that the ML system is considered holistically to ensure that all the components (business requirements, the data stack, infrastructure, deployment, monitoring) and their stakeholders can work together to satisfy the specified objectives and requirements
What are the four general requirements for ML systems?
- Reliability
- Scalability
- Maintainability
- Adaptability
Why is reliability a requirement for ML systems?
The system should continue to perform the correct function at the desired level of performance even in the face of adversity (hardware or software faults, human error). Traditional software systems yield an error, but ML systems can fail silently
Why is scalability a requirement for ML systems?
ML systems can grow in multiple ways: complexity (more parameters), traffic volume (more predictions per given time), model count (more use cases). These are examples of resource scaling, but handling growth also includes artifact management
Why is maintainability a requirement for ML systems?
Structuring workloads and set up infrastructure such that all contributors can work using tools they want is important. Code should be documented. Code, data, and artifacts should be versioned. Models should be sufficiently reproducible.
Why is adaptability a requirement for ML systems?
To adapt to shifting data distributions and business requirements, the system should have some capacity for both discovering aspects for performance improvement and allowing updates without service interruption
What are some examples of nonprobability sampling?
- Convenience sampling
- Snowball sampling
- Judgment sampling
- Quota sampling
What is convenience sampling?
A nonprobability sampling method where samples of data are selected based on their availability. This sampling method is popular because it’s convenient
What is snowball sampling?
A nonprobability sampling method where future samples are selected based on existing samples. For example, to scrape legitimate Twitter accounts, you start with a small number of accounts, then you scrape all the accounts they follow, and so on
What is judgment sampling?
A nonprobability sampling method where the experts decide what samples to include
What is quota sampling?
A nonprobability sampling method where samples are selected based on quotas for certain slices of data without any randomization. For example, when the same number of samples are selected per age group for a survey, regardless of the actual age distribution
What are some examples of random sampling?
- Simple random sampling
- Stratified sampling
- Weighted sampling
- Reservoir sampling
- Importance sampling
What is simple random sampling?
In this form of random sampling, all samples in the population are given equal probabilities of being selected
What is a drawback of simple random sampling?
Rare categories of data might not appear in your selection
What is stratified sampling?
A random sampling method where the population is divided into groups that are relevant and sample from each group separately. For example, to sample 1% of data that has classes A and B, 1% can be sampled from each class separately. This way, both classes will be included in the selected, no matter how rare class A or B is
What is a drawback of stratified sampling?
It is not always possible to divide all samples into groups. For instance, when a sample belongs to multiple groups, as in the case of multilabel tasks
What is weighted sampling?
A random sampling method where each sample is given a weight, which determines the probability of it being selected
What is the advantage of using weighted sampling?
This method leverages domain expertise. For example, if a certain subpopulation of data, such as more recent data, is more valuable to the model and needs to have a higher chance of being selected, a higher weight can be given to this subpopulation. If the data comes from a different distribution compared to the true data, samples that are underrepresented in the data can be given a higher weight
What is reservoir sampling?
A random sampling method that is especially useful when dealing with streaming data. Reservoir sampling ensures that every sample has an equal probability of being selected and when the selection algorithm is stopped at any time, the samples are selected with the correct probability
How does the reservoir sampling algorithm work?
The algorithm involves a reservoir, which can be an array and consists of three steps:
- Put the first k elements into the reservoir
- For each incoming nth element, generate a random number i such that 1 ≤ i ≤ n
- If 1 ≤ i ≤ k: replace the ith element in the reservoir with the nth element. Else, do nothing
What is importance sampling?
A random sampling method that allows sampling from a distribution when there is only access to another distribution. For example, if distribution P(x) is expensive or infeasible to sample from and Q(x) is easy to sample from, a sample x is selected from Q(x) and then weighed by P(x)/Q(x). Q(x) is called the proposal distribution or the importance distribution. Q(x) can be any distribution as long as Q(x) > 0 whenever P(x) ≠ 0
What is label ambiguity or label multiplicity?
When there are multiple conflicting labels for a data instance
What is an example of label multiplicity?
When a company uses multiple sources and relies on multiple annotators who have different levels of expertise
What is weak supervision?
A labeling technique where labeling functions (LFs) are used to label samples