ML_SWEngDevOps Flashcards
report/experiment data science
- get dataset
- clean data
- process data
- optimize hyperparameters
- call fit/predict
- report results
CI/CD - DevOps pipeline
iterate through:
- code and push to SVC
- test
- build
- deploy
Testing in ML
- applicable in 2 different scenarios:
- using DEV Test set for estimator optimization
- using Test set for variance, performance evaluation
- different from integration, unit testing
- test data sets must be kept separately from training data
ML training and DevOps
- ML training time can be much longer than CI/CD test, build time
- do it outside the CI/CD cycle with its own timeline
- ML training data should not be kept (size and business reason) in the same repo as the code
- also given the performance targets retraining of a model may not produce the desired outcome w/o a bug in code
ML as a separate service
Pros: - clear separation of responsabilities - ability to use different progr. lang & framework suitable for the task Cons: - unclear boundaries for the ML service
std DevOps methods - technical debt
- refactoring
- increase code coverage with unit tests
- remove dead code
- decrease dependencies
- tighten APIs
- improve documentation
ML related technical debt
- blur system-level abstraction boundaries
- reuse signals that increase component coupling
- use glue code for the ML black boxes
- changes in real world signals may change unexpectedly ML system behaviour affecting maintenance cost
Changing Anything Changes Everything (CACE) / Entanglement
- machine learning systems create entanglement with isolation from data sources effectively impossible
- no inputs are ever really independent
- changes in hyperparameters have a similar effect on behavior
- first version of ML system may be easy, subsequent improvements difficult
CACE - mitigation strategies
- isolate models and server ensembles
- develop methods that allow for deep insights into models prediction behavior
- use more sofisticated regularization methods to enforce that changes in perf are related to a cost in the objective function in training
CACE -mitigation strategies cons/pros
strategy 1.
- this approach may not scale in all situations
- when maintenance cost is outweighed by modularity benefits
strategy 2.
- use vizualization to see effects on diff. dimensions
- use metrics on a slice-by-slice basis
strategy 3.
- may add more debt by increasing system complexity
Hidden feedback loops
- changes from the real-world (i.e. user clicks) are included in the data over a period of time longer than the rate of occurence of the event (i.e. clicks weekly aggregated)
- system change may occur subtly over a longer period of time (i.e. more than a week) thus not visible in quick experiments
- remove such loops whenever feasible
Undeclared consumers
- aka visibility debt
- different output of a ML system (i.e. logs) can be used by other ‘undeclared’ system creating ‘undeclared’/unintended dependencies that can be affected by future changes of the ML system
- signal is grabbed when available and under deadline pressure
- difficult to detect
- design to effectively guard against it
Data dependencies
- contributes to code complexity and technical debt
- building large data-dependency chains are difficult to untangle
Data dependency problems
- unstable data dependencies
- underutilized data dependencies
- static analysis data dependencies
- correction cascades
Unstable data dependencies
- qualitatively change over time
- from another model that updates over time
- from data-dependent lookup table (i.e. tf-idf calc)
- from engineering ownership of data input signal diff from the eng ownership of ML model
- can be mitigated with ‘data/signal versioning’
- can also increase technical debt
Underutilized data dependencies
- similar to YAGNI code
- legacy features (in time redundant and not removed)
- bundled features (may include features with little or no value)
- eta-features (their related performance gain is outweighted by increase in complexity)
- can be mitigated by often eval and removal of usch features
Static analysis of data dependencies
- no real correspondent to similar tools for code analysis
- in large systems not everyone may know all features or where are used
- can be mitigated by doing annotation for data sources and code - ideally automatically
Correction cascades
- similar to ‘boosting’ for expedience - use a new model to correct the errors of an existing one (used for a slightly different problem, applied to slightly different test distributions)
- difficult to make improvements over time and may end up in a local optimum
- can be mitigated by augmenting the original model with new features to help with distiguinshing between use-cases
System-level spaghetti code
system-design anti-patterns:
- glue code needed for general purpose self-contained packages
- pipeline jungles when data prepration is made of a bunch of scrapes, joins, sampling steps with intermediate files, diffincult to maintain, test, recover from failure
- dead experimental code paths from doing alternative experiments as conditional branches in production code; difficult to maintain backward compatibility, can interact in unpredicatble ways, increase system complexity
- configuration debt - large systems with lots of configuration options: feature used, how data is selected, algo settings, pre & post processing, etc.
Spaghetti code - solutions
-
glue code:
- reduce it by re-implementation within system’s problem space tweaked with problem-specific knowledge
- hybrid teams: research and engineering
-
pipeline jungle:
- do a ‘hollistic’ design for data collection and feature extraction from the very beginning (may require re-start)
- hybrid teams: research and engineering
-
dead experimental code paths:
- evaluate & remove dead code paths
- isolate experimental code & tighten code APIs
-
configuration debt:
- visual side-by-side diffs of configs (usually copy&paste files w/ small modifications)
- assertions about config invariants (must be carefully thought out)
Changes in the external world
- fixed thresholds (manually set) in dynamic systems: evaluate them w/ heldout validation data, estimate with numerical optimization
- when correlations no longer correlate: use ML strategies to tell apart the correlation effects
-
monitoring and testing: don’t rely only on unit and integration testing but also on live monitoring
- start ‘what to monitor?’ analysis from:
- prediction bias: i.e. the distribution of predicted labels matching that of observed labels (with caveats) is a useful diagnostic to detect sudden changes in real world
- action limits for systems that take action in the real world to trigger (not spuriously) notifications that should be looked into