Reproducibility in Machine Learning

  • by

This summarizes part of the talk, “Realizing End to End Reproducible Machine Learning on Kubernetes” given by Suneeta Mall at Nearmap at KubeCon 2019.

“Good results are not enough, making them easily reproducible also makes them credible” – Yann LeCun, Facebook AI Research

Broadly speaking machine learning is complex mathetmatics on top of data to find patterns and associate meaning to them.

It is important that the results are unbiased, verifiable, traceable, and reproducible. We could also include ethical.

Why is reproducibility important?

What are challenges of reproducibility?

  • Hardware, like different GPU architectures. Parallelism may even give different results!
  • Software
  • Randomness
  • Data
    • Data poisoning (Think Tay, the Microsoft chat bot)
    • Over or under represented data
    • Concept Drift & continual learning (A car now vs the past vs the future)

How much reproducability do we actually need and how do we do it?

  1. Reproducible code
  2. Version Control Everything! Not just the code, but the infrastructure, workflow, and data itself. Some tools that can assist in this Pachyderm (git-like data repository) Kubeflow (ML toolkit)
  3. Model Robustness (Your model not changing significantly when testing on similar, but independent data sets.)

Leave a Reply

Your email address will not be published. Required fields are marked *