This summarizes part of the talk, “Realizing End to End Reproducible Machine Learning on Kubernetes” given by Suneeta Mall at Nearmap at KubeCon 2019.
“Good results are not enough, making them easily reproducible also makes them credible” – Yann LeCun, Facebook AI Research
Broadly speaking machine learning is complex mathetmatics on top of data to find patterns and associate meaning to them.
It is important that the results are unbiased, verifiable, traceable, and reproducible. We could also include ethical.
Why is reproducibility important?
- Helps in understanding, explaining and debugging the results and deductions
- Helps in correctness
- Avoiding the major problems like Amazon’s recruiting tool that showed bias against women and IBM’s cancer detection was deemed “unsafe and incorrect”
What are challenges of reproducibility?
- Hardware, like different GPU architectures. Parallelism may even give different results!
- Software
- Randomness
- Data
- Data poisoning (Think Tay, the Microsoft chat bot)
- Over or under represented data
- Concept Drift & continual learning (A car now vs the past vs the future)
How much reproducability do we actually need and how do we do it?
- Reproducible code
- Version Control Everything! Not just the code, but the infrastructure, workflow, and data itself. Some tools that can assist in this Pachyderm (git-like data repository) Kubeflow (ML toolkit)
- Model Robustness (Your model not changing significantly when testing on similar, but independent data sets.)