Systems Design Flashcards by Zarif Atai

What is ML systems design?

It takes a systems approach to MLOps, which means that the ML system is considered holistically to ensure that all the components (business requirements, the data stack, infrastructure, deployment, monitoring) and their stakeholders can work together to satisfy the specified objectives and requirements

How well did you know this?

Not at all

Perfectly

What are the four general requirements for ML systems?

Reliability
Scalability
Maintainability
Adaptability

How well did you know this?

Not at all

Perfectly

Why is reliability a requirement for ML systems?

The system should continue to perform the correct function at the desired level of performance even in the face of adversity (hardware or software faults, human error). Traditional software systems yield an error, but ML systems can fail silently

How well did you know this?

Not at all

Perfectly

Why is scalability a requirement for ML systems?

ML systems can grow in multiple ways: complexity (more parameters), traffic volume (more predictions per given time), model count (more use cases). These are examples of resource scaling, but handling growth also includes artifact management

How well did you know this?

Not at all

Perfectly

Why is maintainability a requirement for ML systems?

Structuring workloads and set up infrastructure such that all contributors can work using tools they want is important. Code should be documented. Code, data, and artifacts should be versioned. Models should be sufficiently reproducible.

How well did you know this?

Not at all

Perfectly

Why is adaptability a requirement for ML systems?

To adapt to shifting data distributions and business requirements, the system should have some capacity for both discovering aspects for performance improvement and allowing updates without service interruption

How well did you know this?

Not at all

Perfectly

What are some examples of nonprobability sampling?

Convenience sampling
Snowball sampling
Judgment sampling
Quota sampling

How well did you know this?

Not at all

Perfectly

What is convenience sampling?

A nonprobability sampling method where samples of data are selected based on their availability. This sampling method is popular because it’s convenient

How well did you know this?

Not at all

Perfectly

What is snowball sampling?

A nonprobability sampling method where future samples are selected based on existing samples. For example, to scrape legitimate Twitter accounts, you start with a small number of accounts, then you scrape all the accounts they follow, and so on

How well did you know this?

Not at all

Perfectly

What is judgment sampling?

A nonprobability sampling method where the experts decide what samples to include

How well did you know this?

Not at all

Perfectly

What is quota sampling?

A nonprobability sampling method where samples are selected based on quotas for certain slices of data without any randomization. For example, when the same number of samples are selected per age group for a survey, regardless of the actual age distribution

How well did you know this?

Not at all

Perfectly

What are some examples of random sampling?

Simple random sampling
Stratified sampling
Weighted sampling
Reservoir sampling
Importance sampling

How well did you know this?

Not at all

Perfectly

What is simple random sampling?

In this form of random sampling, all samples in the population are given equal probabilities of being selected

How well did you know this?

Not at all

Perfectly

What is a drawback of simple random sampling?

Rare categories of data might not appear in your selection

How well did you know this?

Not at all

Perfectly

What is stratified sampling?

A random sampling method where the population is divided into groups that are relevant and sample from each group separately. For example, to sample 1% of data that has classes A and B, 1% can be sampled from each class separately. This way, both classes will be included in the selected, no matter how rare class A or B is

How well did you know this?

Not at all

Perfectly

What is a drawback of stratified sampling?

It is not always possible to divide all samples into groups. For instance, when a sample belongs to multiple groups, as in the case of multilabel tasks

What is weighted sampling?

A random sampling method where each sample is given a weight, which determines the probability of it being selected

What is the advantage of using weighted sampling?

This method leverages domain expertise. For example, if a certain subpopulation of data, such as more recent data, is more valuable to the model and needs to have a higher chance of being selected, a higher weight can be given to this subpopulation. If the data comes from a different distribution compared to the true data, samples that are underrepresented in the data can be given a higher weight

What is reservoir sampling?

A random sampling method that is especially useful when dealing with streaming data. Reservoir sampling ensures that every sample has an equal probability of being selected and when the selection algorithm is stopped at any time, the samples are selected with the correct probability

How does the reservoir sampling algorithm work?

The algorithm involves a reservoir, which can be an array and consists of three steps:

Put the first k elements into the reservoir
For each incoming nth element, generate a random number i such that 1 ≤ i ≤ n
If 1 ≤ i ≤ k: replace the ith element in the reservoir with the nth element. Else, do nothing

What is importance sampling?

A random sampling method that allows sampling from a distribution when there is only access to another distribution. For example, if distribution P(x) is expensive or infeasible to sample from and Q(x) is easy to sample from, a sample x is selected from Q(x) and then weighed by P(x)/Q(x). Q(x) is called the proposal distribution or the importance distribution. Q(x) can be any distribution as long as Q(x) > 0 whenever P(x) ≠ 0

What is label ambiguity or label multiplicity?

When there are multiple conflicting labels for a data instance

What is an example of label multiplicity?

When a company uses multiple sources and relies on multiple annotators who have different levels of expertise

What is weak supervision?

A labeling technique where labeling functions (LFs) are used to label samples

What are labeling functions in weak supervision?

A function that encodes heuristics. They can be developed with subject matter expertise. Examples of LFs include keyword heuristic (if this is mentioned then do that), regular expression, database lookup, outputs of other models

What is semi-supervision?

A labeling technique that leverages structural assumptions to generate new labels based on a small set of initial labels

What are examples of semi-supervision?

1. self-training: training a model on an existing set of labeled data and using this model to make predictions for unlabelled samples. Assuming that predictions with high raw probability scores are correct, the labels predicted with high probability are added to the training set and train a new model on this expanded training set 2. Assuming that data samples that share similar characteristics share the same labels (e.g. unlabelled hashtags in a tweet with a labeled hashtag) 3. Perturbation (e.g. small adjustments to labeled images to create new training instances)

What is transfer learning?

Refers to the family of methods where a model developed for a task is reused as the starting point for a model on a second task

What is active learning?

A method for improving the efficiency of data labels. Rather than randomly labeling data samples, the samples that are most helpful to the model according to some metrics or heuristics are labeled. For example, only labeling the examples that the model is the least certain about (e.g. choosing the samples with the lowest probability for the predicted class)

What are some ways to handle class imbalance?

- Using the right evaluation metrics (accuracy, precision, recall, f1 score) - Resampling (undersampling, oversampling) - Cost-sensitive learning - Class-balanced cost - Focal loss

What is cost-sensitive learning?

A method to handle class imbalance in which the loss function is modified to take into account different costs depending on the class. The problem with this loss function is that the cost matrix must be manually defined, which is different for different tasks

What is class-balanced loss?

A method to handle class imbalance by punishing the model for making wrong predictions on minority classes

What is focal loss?

A method to handle class imbalance where the loss is adjusted so that if a sample has a lower probability of being right, it will have a higher weight