The Importance of Reproducibility
Aarni Koskela / July 11, 2018
Reproducibility and replicability are cornerstones of the scientific method. Every so often there’s a sensationalized news article about a new scientific study with astounding results (for instance, we’re looking forward to seeing what’s hot at ICML 2018 – we’re attending, come say hi!) – and it’s not uncommon in these cases that there’s no way for other fellow scientists to verify these results by themselves, be it due to missing or proprietary data, or faulty methodologies. This, naturally, casts shade over the entire study in question.
When do we need reproducibility?
How does this relate to machine learning, though, especially if you’re not doing it in a public and scientific context? You probably have heard of the fatal Uber self-driving vehicle crash that happened in May this year . Powered by an array of sensors and what we can only assume is one heck of a machine learning model, what if the car had been destroyed in the accident as well, and there would be no way to determine just what went wrong?
Or in the nascent field of medical applications for machine learning: for instance, consider an ML system that makes treatment decisions based on its observations. Sure, real human doctors will no doubt be in charge of making the final call, as they are at present – but as we all know, humans (even doctors!) tend to be lazy and/or tired from being overworked. When they start trusting the ML systems enough, they’ll likely just start okaying the systems’ decisions, and when things go wrong at that stage, there must be transparency and traceability – i.e. reproducibility – of the system, as it was deployed at that point in time.
Similar issues also arise in more traditional fields, such as chemistry and manufacturing, where we can imagine an ML system controlling a production process. The point is, if these systems are developed and deployed with an ad-hoc process with no true way of keeping track of what went into the magic ML sauce, there’s hell to pay if (or, if we are to believe Mr. Murphy, when) something bad happens.
Keeping track of the experiments is a good start
While working on our machine learning platform and talking to companies and people doing ML, we’ve heard quite some horror stories about folks keeping track of things in e.g. Google Sheets spreadsheets and copy-pasting results to a Slack channel. That’s like sharing code over post-it notes… no, that won’t do!
I won’t lie to you: Valohai isn’t going to automagically make everything about your work reproducible with a snap of your fingers – you’ll need some scientific rigor of your own too – but we do a lot to help you along. Let me count the ways…
The first thing is dependency control. Our platform works with Docker images, so you can choose from hundreds of public Docker images of machine learning libraries – and if there’s some dependency you need that these images don’t have, we can help you build your own additions on top of those images. Being based on immutable Docker images makes it much easier for you to keep track of the library versions being used, and when the inevitable need to get back to an older version of your stack arises, you can be pretty sure the recipe to get that stack running again is already there.
The second thing you’ll want to be sure of is your code. Valohai is tightly tied to the Git revision control system, and we strongly advise all production code to be versioned. (However, we’re programmers and we also recognize that it can be a pain to have to do a commit-push dance, which is why our command line client supports ad-hoc executions too.) This also includes the command(s), options and switches used to run your code. The Valohai platform naturally keeps track of these.
Third is your input files, e.g. your training data. When you run executions on Valohai, we keep track of the exact input files you’re using. If you don’t purposely overwrite or delete your data, you can be sure that you’ll be able to run experiments on the same data you’ve used a year or two ago.
Fourth is the output of your experiments, e.g. the trained model files (or if you’re just running feature extraction, the transformed data, for instance). These are saved securely either to Valohai-managed storage, or your own Amazon S3 or Azure Blob Storage accounts. (Support for other cloud storage is on the roadmap. Let us know if there’s some provider you’d absolutely need.) Naturally, if the need arises, you can purge old data too, but at least nothing gets deleted by accident.
These are things that many companies rolling out their own ML pipeline, be it on-premises or in the cloud, don’t necessarily stop to think about before it’s too late. If we can help out with your ML workflow, get in touch !