Recent patch notes

  • Non-integer number values for parameters and metadata are rounded.
  • Allow ordering executions by parameter.
  • Bugfix: Using Promise.finally() caused some browsers to crash on execution creation.
  • Shell variable syntax ${foo} is now supported within commands.
  • Stores can now be deleted.
  • Improvements to metadata and log views.
  • Bugfix: custom S3 stores no longer require the Multipart IAM Upload Role field.
  • Bugfix: Retry functionality in the frontend now works better.

Project management

You own and collaborate on projects; a project is a digital space for tackling a specific machine learning problem.

A project is linked to one or more remote Git repositories (not just GitHub). These linked repositories define what kind of "runs" or "tasks" can be executed in that project context.

The version control repository must have a valohai.yaml file that defines these execution templates. Fire-and-forget style experimentation is also supported using the command line client.

step:
  name: Training
  image: gcr.io/tensorflow/tensorflow:0.12.1-devel-gpu
  command: python train.py {parameters}
  inputs:
    - name: training-set
  parameters:
    - name: learning_rate
      pass-as: --learning_rate={v}
      description: Initial learning rate
      type: float
      default: 0.001

Here are some more in depth examples:

Psst, the YAML standard and tooling is open source:
github.com/valohai/valohai-yaml

Executing code

We support running virtually anything that runs inside a Docker container; you can use readily available images or provide URLs of your own.

You can work through either the web browser user interface or a separate command line client.

Executed scripts are sourced from version control or uploaded from a local directory using our command line client.

Dependencies should be pre-installed on Docker images. You can also install them, but that will lengthen the bootup time of cloud workers.

Valohai records the input files and parameters used by executions automatically for reproducibility.

Workers can download input files from various sources depending on the adapter in use:

  • HTTP endpoint
  • S3 endpoint
  • local directory for on-premises

Infrastructure

We maintain automatically scaled clusters of worker servers for your use. Currently we only support AWS and private hardware environments but adding more providers is on the roadmap. Private hardware is integrated using our installable agent software.

Our supply of worker servers comes in various styles and sizes:

Type GPU vCPUs RAM
aws.g2.2xlarge NVIDIA GRID K520, 4GB 8 15GB
aws.p2.xlarge NVIDIA Tesla K80, 12GB 4 61GB
aws.p3.2xlarge NVIDIA Tesla V100, 16GB 8 61GB
aws.p3.8xlarge 4x NVIDIA Tesla V100, 64GB 32 244GB
aws.p3.16xlarge 8x NVIDIA Tesla V100, 128GB 64 488GB
aws.c4.2xlarge - 8 15GB
aws.c4.4xlarge - 16 30GB
aws.m4.2xlarge - 8 32GB
aws.m4.4xlarge - 16 64GB
aws.t2.xlarge - 4 16GB

Gathering results

From a technical point of view, everything written to stdout (e.g. print()) or stderr is logged and visible in real-time through our command line client and web app visualizations. You can also access the data via our REST API.

If log output looks like JSON, we interpret it as visualizable metadata that you can compare between executions.

Saving files to /valohai/outputs indicates that you want to store the files e.g. to be downloaded locally or to be used as an input to another execution.

The choice of output adapter defines where the file is stored:

  • By default, we save execution outputs to Valohai S3 in the region closest to the computation.
  • You can configure your own S3 bucket for uploads.
  • Private workers can also store files in their local file system.

Collaboration

You invite other people to your project as collaborators. When starting executions within someone else's project, they pay for the resources but this can be customized for organization accounts.

You can work inside the same linked Git repository using your own Git branches without conflicts, but when not actively developing the code of your solution, it makes more sense to work using the same branch.

Projects are private by default but can be made public with various privilege levels.

Hyperparameter optimization

Hyperparameter optimizations are achieved with the use of tasks.

Tasks are collections of parallel experiments that can be manually inspected or instructed automatically to feed the best output results forward in a pipeline.

Currently, the following optimizations are supported, and more are on the roadmap.

  • List sweeps
  • Linear range sweeps
  • Logspace sweeps

Do you want to know more?