ML_SWEngDevOps Flashcards

1
Q

report/experiment data science

A
  • get dataset
  • clean data
  • process data
  • optimize hyperparameters
  • call fit/predict
  • report results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

CI/CD - DevOps pipeline

A

iterate through:

  • code and push to SVC
  • test
  • build
  • deploy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Testing in ML

A
  • applicable in 2 different scenarios:
    • using DEV Test set for estimator optimization
    • using Test set for variance, performance evaluation
  • different from integration, unit testing
  • test data sets must be kept separately from training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ML training and DevOps

A
  • ML training time can be much longer than CI/CD test, build time
  • do it outside the CI/CD cycle with its own timeline
  • ML training data should not be kept (size and business reason) in the same repo as the code
  • also given the performance targets retraining of a model may not produce the desired outcome w/o a bug in code
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ML as a separate service

A
Pros:
- clear separation of responsabilities
- ability to use different progr. lang & framework suitable for the task
Cons:
- unclear boundaries for the ML service
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

std DevOps methods - technical debt

A
  • refactoring
  • increase code coverage with unit tests
  • remove dead code
  • decrease dependencies
  • tighten APIs
  • improve documentation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

ML related technical debt

A
  • blur system-level abstraction boundaries
  • reuse signals that increase component coupling
  • use glue code for the ML black boxes
  • changes in real world signals may change unexpectedly ML system behaviour affecting maintenance cost
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Changing Anything Changes Everything (CACE) / Entanglement

A
  • machine learning systems create entanglement with isolation from data sources effectively impossible
  • no inputs are ever really independent
  • changes in hyperparameters have a similar effect on behavior
  • first version of ML system may be easy, subsequent improvements difficult
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

CACE - mitigation strategies

A
  1. isolate models and server ensembles
  2. develop methods that allow for deep insights into models prediction behavior
  3. use more sofisticated regularization methods to enforce that changes in perf are related to a cost in the objective function in training
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

CACE -mitigation strategies cons/pros

A

strategy 1.
- this approach may not scale in all situations
- when maintenance cost is outweighed by modularity benefits
strategy 2.
- use vizualization to see effects on diff. dimensions
- use metrics on a slice-by-slice basis
strategy 3.
- may add more debt by increasing system complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Hidden feedback loops

A
  • changes from the real-world (i.e. user clicks) are included in the data over a period of time longer than the rate of occurence of the event (i.e. clicks weekly aggregated)
  • system change may occur subtly over a longer period of time (i.e. more than a week) thus not visible in quick experiments
  • remove such loops whenever feasible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Undeclared consumers

A
  • aka visibility debt
  • different output of a ML system (i.e. logs) can be used by other ‘undeclared’ system creating ‘undeclared’/unintended dependencies that can be affected by future changes of the ML system
  • signal is grabbed when available and under deadline pressure
  • difficult to detect
  • design to effectively guard against it
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data dependencies

A
  • contributes to code complexity and technical debt

- building large data-dependency chains are difficult to untangle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data dependency problems

A
  • unstable data dependencies
  • underutilized data dependencies
  • static analysis data dependencies
  • correction cascades
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Unstable data dependencies

A
  • qualitatively change over time
  • from another model that updates over time
  • from data-dependent lookup table (i.e. tf-idf calc)
  • from engineering ownership of data input signal diff from the eng ownership of ML model
  • can be mitigated with ‘data/signal versioning’
    • can also increase technical debt
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Underutilized data dependencies

A
  • similar to YAGNI code
  • legacy features (in time redundant and not removed)
  • bundled features (may include features with little or no value)
  • eta-features (their related performance gain is outweighted by increase in complexity)
  • can be mitigated by often eval and removal of usch features
17
Q

Static analysis of data dependencies

A
  • no real correspondent to similar tools for code analysis
  • in large systems not everyone may know all features or where are used
  • can be mitigated by doing annotation for data sources and code - ideally automatically
18
Q

Correction cascades

A
  • similar to ‘boosting’ for expedience - use a new model to correct the errors of an existing one (used for a slightly different problem, applied to slightly different test distributions)
  • difficult to make improvements over time and may end up in a local optimum
  • can be mitigated by augmenting the original model with new features to help with distiguinshing between use-cases
19
Q

System-level spaghetti code

A

system-design anti-patterns:

  • glue code needed for general purpose self-contained packages
  • pipeline jungles when data prepration is made of a bunch of scrapes, joins, sampling steps with intermediate files, diffincult to maintain, test, recover from failure
  • dead experimental code paths from doing alternative experiments as conditional branches in production code; difficult to maintain backward compatibility, can interact in unpredicatble ways, increase system complexity
  • configuration debt - large systems with lots of configuration options: feature used, how data is selected, algo settings, pre & post processing, etc.
20
Q

Spaghetti code - solutions

A
  • glue code:
    • reduce it by re-implementation within system’s problem space tweaked with problem-specific knowledge
    • hybrid teams: research and engineering
  • pipeline jungle:
    • do a ‘hollistic’ design for data collection and feature extraction from the very beginning (may require re-start)
    • hybrid teams: research and engineering
  • dead experimental code paths:
    • evaluate & remove dead code paths
    • isolate experimental code & tighten code APIs
  • configuration debt:
    • visual side-by-side diffs of configs (usually copy&paste files w/ small modifications)
    • assertions about config invariants (must be carefully thought out)
21
Q

Changes in the external world

A
  • fixed thresholds (manually set) in dynamic systems: evaluate them w/ heldout validation data, estimate with numerical optimization
  • when correlations no longer correlate: use ML strategies to tell apart the correlation effects
  • monitoring and testing: don’t rely only on unit and integration testing but also on live monitoring
    • start ‘what to monitor?’ analysis from:
      • prediction bias: i.e. the distribution of predicted labels matching that of observed labels (with caveats) is a useful diagnostic to detect sudden changes in real world
      • action limits for systems that take action in the real world to trigger (not spuriously) notifications that should be looked into