Product Update: Datum Improvements
Juha Kiili / July 07, 2021
What is a datum anyway?
Datum is a version-controlled file inside the Valohai platform. Every datum is immutable by design. Once it is created, it can't be changed. This is very important for maintaining reliable reproducibility, but sometimes you want to be more dynamic.
We have introduced three new improvements for more flexibility over datums:
Datum alias works as a virtual bucket. It is not immutable - the data can change - but on the surface, it looks and feels just like the classic datum. Valohai keeps track of the changes, too. If you used a datum alias 6 months ago to train your model, you will get the same result today if you reproduce the experiment.
One scenario where this is useful is creating asynchronous pipelines. Let's say you have a pipeline that preprocesses data and then you have another pipeline for the actual training. Your preprocessing pipeline can now output into an alias, which in turn is consumed by your training pipeline. This allows for these two pipelines to be completely decoupled. Running the training pipeline is no longer tied to running the preprocessing pipeline or vice-versa. It was already possible to do this without datum aliases, but this new built-in approach ensures robust end-to-end versioning and fully decouples your pipelines from any specific data store, too.
Another new feature is datum adoption. This simply means that you can now adapt any existing data to the Valohai platform versioning system. Previously our customers would start getting the benefit of datum versioning once they started producing new artifacts inside the platform, but now they have the opportunity to adopt any existing files to the versioning system from day one.
The last improvement is datum metadata. Previously in Valohai, your metrics were always linked to the execution that created the model. Valohai has the ability to track that metadata from the model back to the execution, but we felt like users should be able to link metadata directly with the model datum, too.