Testing Flashcards
3 Software testing types:
- Unit tests - test the functionality of a single price of code (like a function)
- Integration tests - test how 2 or more units work together
- End to end tests - tests the entire system
Best practices for testing
- Automate your testing
- Make sure the tests are reliable, fast, and go through code review as the rest of the code. (Buggy test are the worst thing)
- Enforce that tests must pass before merging
- When you find production bugs convert them to tests
- Follow the testing pyramid
What is the testing pyramid?
Right more unit tests>integration>e2e
70/20/10
The unit tests are faster, more reliable, and easier at isolating the failures.
Solitary testing
Doesn’t rely on really data from other units, so you make up the data and test it. It’s good the test exactly what you want
Sociable testing
Makes an assumption that other modules are working and testing with their outputs
Test coverage
Shows how many lines of code are being tested
Good for finding areas that are not tested.
But can be misleading because it doesn’t check the quality of the tests which is what we really care about.
Test driven development
You first right you tests, than you write small parts of code that will just make you last the one test, than check on the bigger tests , and iterate.
(Not sure how accurate this is, but the idea is simple)
Testing in production, why and how?
Why - anyways most bugs will not be caught before, it’s inevitable. So you might as well build a system that will monitor the errors fast and clearly so you can fix them once it’s out.
How:
- Canary deployments - roll it out to a small percentage of users (1%…) So not all will get the bug
- A/B testing - for more statistical tests if you know what you care about
- Real user monitoring - like the actual behavior
- Exploratory testing - not set it up in advance
CI/CD
Testing done by saas, testing the code once it is pushed. As a cloud job.
Best free and easy one is GitHub actions
Testing only machine learning model and not the system is not enough, why?
The model itself is just a small piece of the system, which includes:
Training system-model-prediction system - serving system - prod data - labeling system - storage and preprocessing system - and back to the start
So each one of these steps should be tested and monitored
Infrastructure tests - unit test for training code
Goal: avoid bugs in the training pipeline
How:
- Unit test like any other code
- Add a single batch/epoch tests that check performance after a small run on tiny dataset
- Run frequently during development
Integration test - test the step between the data system and the training system
Goal: make sure trading is reproducible
How:
Take a piece of the dataset and Run a trading run :)
Then check to make sure that the performance remains consistent
Consider pulling a sliding window of window (the data of the last week…)
Run it periodically
Functionality tests - unit test for the prediction code
Goal: avoid stuff in the code that makes up the prediction infra
How:
- Unit test the code like any other
- Load pre trained model and test prediction on few examples
- Run frequently during development
Evaluation tests, goal and how?
Goal:
make sure a new model is ready to go into production
How:
- evaluate your model on all of the metrics, datasets, and slices that you care about
- compare the new model to the old one and your baselines
- understand the performance envelope (how is the model working on different groups? What types of data will cost the model to not preform well) of the model
- run every time you have a new candidate model
Behavioral test sing metrics (part of evaluation tests):
Goal:
Make sure the model has the invariances we expect. meaning - does it perform the way we expect on the perturbations of the data (the deviations of the datasets)
Types:
Invariance tests: assert that change in input shouldn’t affect the output (if we change a city name in Sentiment analysis it should have the same results)
Directional tests: assert that change in input should change the output (like in Sentiment analysis changing from negative word to positive)
Min functionality tests - certain inputs and outputs should always produce a given results