Using DVC to version control your ML experiment data

In this blog post we will explore how you can use DVC for your data version control and how you can automate your data version control with and without DVC inside the Valohai platform.

DVC is an open source command-line tool for version controlling your binary data in the same way as you version control code in Git. You hook it up to your data store (e.g. AWS S3 or Azure Blob Storage) and after that use it in the same way as you use Git for pulling and pushing files.

Here we explore the usage of DVC as a version control system for machine learning, how it integrates with the Valohai platform and what benefits you can get from both.

In a nutshell, you would use DVC together with Git in the following way:

git pull   # to pull a specific version of your code
dvc pull   # if properly configured, this will download files used locally
dvc run ... python train.py  # run something and record the results
dvc push   # sends local files to remote storage
git add .
git commit -m ‘Did some changes’
git push   # save code changes, allowing others to `dvc pull` after `git pull`

DVC will create meta-files (*.dvc) for:

all datasets and artifacts relating to the project
each dvc run you do to record what were the inputs and outputs of the command
metric files for recording results from commands, these will be saved to Git as-is

Integrating DVC with Valohai

Before jumping into how, you should first ask why you’d like to use DVC. Valohai on its own already automatically version controls all input and output data from every experiment and pre-processing step that you conduct. Thus the only use-case where DVC makes sense are cases where you’ve already used DVC previously and for compliance (or nostalgic?) reasons don’t want to get rid of it. In that case the instructions below apply.

Using DVC with Valohai is as easy as calling the dvc library on your command line. As Valohai builds on defining machine learning pipeline steps, you can add the call to dvc directly inside the pipeline step. In the example below we run a “train-my-model” step, on a container using TensorFlow with a few additional commands for calling dvc. This way dvc is called every time automatically, when you complete a training run. Thus dvc is transparent to the user.

- step:
    name: train-my-model
    image: tensorflow/tensorflow:1.13.1-py3
    command:
      -  # configure your AWS credentials and DVC remote if not setup in the Git
      - dvc pull
      - dvc run (dvc configuration) python train.py`

This however doesn’t make much sense functionality wise as Valohai already provides data management for all inputs & outputs to your training runs.

Valohai’s automatic data management to the rescue!

The core difference is how Valohai and DVC do record keeping is that with DVC you will be tracking metadata about your data in your code repository (Git), whereas Valohai automatically tracks this for you in a dedicated database. Below a few examples on the differences between DVC and Valohai.

Storing files…

…in DVC:

You run dvc add path/to/filename.ext or dvc run -o path/to/filename.ext <YOUR-COMMAND> which generates the metafile filename.ext.dvc
Then you use git add; git commit; git push and dvc push to record the data.

…in Valohai:

In Valohai runtime, you just write files to /valohai/outputs and Valohai stores it in your data store and records a reference to it. (You can also upload files through the web UI by hand.)

Using files…

…in DVC:

you first go to a specific git commit with the file version you want using git checkout
then run dvc pull to download the right version of the data
and finally train on that data

…in Valohai:

you specify the address of the files e.g. s3://my-data/path/to/file.ext and Valohai downloads them automatically before running your training code

Using old files with updated code…

…in DVC:

you pull old code with git and then
cherry pick specific file changes using git to get the old .dvc metafiles
and the run dvc pull to download the right version of the data

…in Valohai:

you select the older dataset from a dropdown in the UI, or if using the CLI you specify the address to the file (e.g. by looking it up in the web UI)

Tracking metadata & metrics while training…

…in DVC:

you write files in the format of your choosing using dvc run -M my-metrics.csv <YOUR-COMMAND>
and you store the data with your code
and you view the metadata as you wish (e.g. in a file editor)

…in Valohai:

you simply print JSON to stdout e.g. json.dumps({'loss': 0.123}) and Valohai will store it and visualize it automatically as graphs

Build machine learning pipelines…

… in DVC:

you run multiple dvc runs with varying -d and -o configuration that will organically build a pipeline that you can then rerun with dvc repro.

… in Valohai:

In Valohai, you specify pipelines with a dynamic syntax then run them from a Web UI

Note that Valohai pipelines are a more full-fledged processing solution where different steps can be run on different hardware or even operating systems, whereas DVC pipelines are run as a sequence on a single machine.

Conclusions

While we have seen that DVC can be used together with Valohai we have also shown that Valohai automatically takes care of everything that DVC does by hand. In conclusion, the only place where there is a benefit of using DVC with Valohai is when you’re already using DVC from before and have a specific need to version data with it specifically. Otherwise, Valohai does all of the things above automatically.