Project management

You own and collaborate on projects; a project is a digital space for tackling a specific machine learning problem.

Project is linked to one or more remote Git repositories (not just GitHub). These linked repositories define what kind of "runs" or "tasks" can be executed in that project context.

The version control repository should have a valohai.yaml file that defines these execution templates. Fire-and-forget style experimentation is also supported using the command line client.

step:
  name: Training
  image: gcr.io/tensorflow/tensorflow:0.12.1-devel-gpu
  command: python train.py {parameters}
  inputs:
    - name: training-set
  parameters:
    - name: learning_rate
      pass-as: --learning_rate={v}
      description: Initial learning rate
      type: float
      default: 0.001

Here are some more in depth examples:

Psst, the YAML standard and tooling is open source:
github.com/valohai/valohai-yaml

Executing code

We support running virtually anything that runs inside a Docker container; use readily available images or provide URL your own.

Code and scripts to be executed are sourced from version control or uploaded from a local directory using our command line client.

Dependencies should be pre-installed on Docker images. We want to make creating Docker images as easy as possible by providing a building pipeline for this in the future.

Execution input files and parameters are recorded for reproducibility.

Where the input files come from depends on the adapter in use:

  • HTTP endpoint
  • S3 endpoint

Infrastructure

We maintain automatically scaled clusters of worker servers for your use. Currently we only support using AWS and private hardware environments but adding more providers is in the pipeline. Private hardware is integrated using installable agent software.

Getting servers in bulk and for a longer time has its benefits; we can offer them cheaper than you'd believe.

Our supply of worker servers comes in various styles and sizes:

Type GPU vCPUs RAM
aws.g2.2xlarge NVIDIA GRID K520, 4 GB, 1536 CUDA cores 8 15 GB
aws.p2.xlarge NVIDIA Tesla K80, 12 GB, 2496 CUDA cores 4 61 GB
aws.c4.2xlarge - 8 15 GB
aws.c4.4xlarge - 16 30 GB
aws.m4.2xlarge - 8 32 GB
aws.m4.4xlarge - 16 64 GB
aws.t2.xlarge - 4 16 GB

Gathering results

From technical point of view, everything written to stdout (e.g. print()) or stderr is logged and visible in real-time through our command line client and web app visualizations. You can also access the data via our REST API.

If log output looks like JSON, we interpret it as chartable metadata that you can compare between executions.

Saving files to /valohai/outputs indicates to us that you want to store the files e.g. to be used in another execution.

Output adapter defines where the file is stored:

  • By default, we store output in your project namespace in our S3 in the region closest to the execution
  • Privately owned workers can store files in their own file system
  • Your S3 bucket

Collaboration

You invite other people to your project as collaborators. When starting executions within someone else's project, they pay for the resources but this can be customized for organization accounts.

You can work inside the same linked Git repository using your own Git branches without conflicts, but when not actively developing the code of your solution, it makes more sense to work using the same branch.

Projects are private by default but can be made public with various privilege levels that can control which resources.

Hyperparameter optimization

Hyperparameter optimizations are achieved with the use of tasks.

Tasks are collections of frequently parallel executions that can be manually inspected or instructed automatically to feed the best output results forward in a pipeline.

Supported optimizations:

  • Use multiple specific values
  • Linear range