ML_SWEngDevOps Flashcards
report/experiment data science
- get dataset
- clean data
- process data
- optimize hyperparameters
- call fit/predict
- report results
CI/CD - DevOps pipeline
iterate through:
- code and push to SVC
- test
- build
- deploy
Testing in ML
- applicable in 2 different scenarios:
- using DEV Test set for estimator optimization
- using Test set for variance, performance evaluation
- different from integration, unit testing
- test data sets must be kept separately from training data
ML training and DevOps
- ML training time can be much longer than CI/CD test, build time
- do it outside the CI/CD cycle with its own timeline
- ML training data should not be kept (size and business reason) in the same repo as the code
- also given the performance targets retraining of a model may not produce the desired outcome w/o a bug in code
ML as a separate service
Pros: - clear separation of responsabilities - ability to use different progr. lang & framework suitable for the task Cons: - unclear boundaries for the ML service
std DevOps methods - technical debt
- refactoring
- increase code coverage with unit tests
- remove dead code
- decrease dependencies
- tighten APIs
- improve documentation
ML related technical debt
- blur system-level abstraction boundaries
- reuse signals that increase component coupling
- use glue code for the ML black boxes
- changes in real world signals may change unexpectedly ML system behaviour affecting maintenance cost
Changing Anything Changes Everything (CACE) / Entanglement
- machine learning systems create entanglement with isolation from data sources effectively impossible
- no inputs are ever really independent
- changes in hyperparameters have a similar effect on behavior
- first version of ML system may be easy, subsequent improvements difficult
CACE - mitigation strategies
- isolate models and server ensembles
- develop methods that allow for deep insights into models prediction behavior
- use more sofisticated regularization methods to enforce that changes in perf are related to a cost in the objective function in training
CACE -mitigation strategies cons/pros
strategy 1.
- this approach may not scale in all situations
- when maintenance cost is outweighed by modularity benefits
strategy 2.
- use vizualization to see effects on diff. dimensions
- use metrics on a slice-by-slice basis
strategy 3.
- may add more debt by increasing system complexity
Hidden feedback loops
- changes from the real-world (i.e. user clicks) are included in the data over a period of time longer than the rate of occurence of the event (i.e. clicks weekly aggregated)
- system change may occur subtly over a longer period of time (i.e. more than a week) thus not visible in quick experiments
- remove such loops whenever feasible
Undeclared consumers
- aka visibility debt
- different output of a ML system (i.e. logs) can be used by other ‘undeclared’ system creating ‘undeclared’/unintended dependencies that can be affected by future changes of the ML system
- signal is grabbed when available and under deadline pressure
- difficult to detect
- design to effectively guard against it
Data dependencies
- contributes to code complexity and technical debt
- building large data-dependency chains are difficult to untangle
Data dependency problems
- unstable data dependencies
- underutilized data dependencies
- static analysis data dependencies
- correction cascades
Unstable data dependencies
- qualitatively change over time
- from another model that updates over time
- from data-dependent lookup table (i.e. tf-idf calc)
- from engineering ownership of data input signal diff from the eng ownership of ML model
- can be mitigated with ‘data/signal versioning’
- can also increase technical debt