Project management

You own and collaborate on projects; a project is a digital space for tackling a specific machine learning problem.

Project is linked to one or more remote Git repositories (not just GitHub). These linked repositories define what kind of "runs" or "tasks" can be executed in that project context.

The version control repository should have a valohai.yaml file that defines these execution templates. Fire-and-forget style experimentation is also supported using the command line client.

step:
  name: Training
  image: gcr.io/tensorflow/tensorflow:0.12.1-devel-gpu
  command: python train.py {parameters}
  inputs:
    - name: training-set
  parameters:
    - name: learning_rate
      pass-as: --learning_rate={v}
      description: Initial learning rate
      type: float
      default: 0.001

Here are some more in depth examples:

Psst, the YAML standard and tooling is open source:
https://github.com/valohai/valohai-yaml">github.com/valohai/valohai-yaml

Executing code

We support running virtually anything that runs inside a Docker container; use readily available images or provide URL your own.

Code and scripts to be executed are sourced from version control or uploaded from a local directory using our command line client.

Dependencies should be pre-installed on Docker images. We want to make creating Docker images as easy as possible by providing a building pipeline for this in the future.

Execution input files and parameters are recorded for reproducibility.

Where the input files come from depends on the adapter in use:

  • HTTP endpoint
  • public S3 endpoint
  • private S3 endpoint with credential management Coming soon
  • adapters for specific databases Coming soon

Coming soon Support of custom dependencies defined in version control e.g. requirements.txt and automated Docker image building pipeline for that.

Infrastructure

We maintain automatically scaled clusters of worker servers for your use. Currently we only support using AWS and private hardware environments but adding more providers is in the pipeline. Private hardware is integrated using installable agent software.

Getting servers in bulk and for a longer time has its benefits; we can offer them cheaper than you'd believe.

Our supply of worker servers comes in various styles and sizes:

Type GPUs GPU RAM CUDA Cores vCPUs RAM
aws.g2.2xlarge 1 4 GB 1,536 8 15 GB
aws.g2.8xlarge Coming soon 4 16 GB 6,144 32 60 GB
aws.p2.xlarge Coming soon 1 12 GB 2,496 4 61 GB
aws.p2.8xlarge Coming soon 8 96 GB 4,992 32 488 GB
aws.p2.16xlarge Coming soon 16 192 GB 9,984 64 732 GB
aws.m3.medium Coming soon - - - 1 3.75 GB
aws.t2.large Coming soon - - - 2 8 GB
aws.c4.xlarge Coming soon - - - 4 7.5 GB
aws.r3.xlarge Coming soon - - - 4 30.5 GB

Gathering results

From technical point of view, everything written to stdout (e.g. print()) or stderr is logged and visible in real-time through our command line client and web app. You can also access the data via our REST API.

Supported visualizations for metadata:

  • Line charts
  • Images as metadata Coming soon
  • 2D plots of neuron activations and layer weights Coming soon

If log output looks like JSON, we interpret it as chartable metadata that you can compare between executions.

Saving files to /valohai/outputs indicates to us that you want to store the files e.g. to be used in another execution.

Output adapter defines where the file is stored:

  • By default, we store output in your project namespace in our S3 in the region closest to the execution
  • Privately owned workers can store files in their own file system
  • Your S3 bucket Coming soon
  • Other output destinations such as FTP Coming soon

Collaboration

You invite other people to your project as collaborators. When starting executions within someone else's project, they pay for the resources but this can be customized for organization accounts.

You can work inside the same linked Git repository using your own Git branches without conflicts, but when not actively developing the code of your solution, it makes more sense to work using the same branch.

Projects are private by default but can be made public with various privilege levels that can control which resources.

Coming soon For teams, knowledge sharing is important so we have email notifications and commenting on most entities; from data files to executions.

Hyperparameter Optimization

Hyperparameter optimizations are achieved with the use of tasks.

Tasks are collections of frequently parallel executions that can be manually inspected or instructed automatically to feed the best output results forward in a pipeline.

Supported optimizations:

  • Use multiple specific values
  • Linear range
  • Custom function Coming soon
  • Logarithmic scale Coming soon
  • Grid search Coming soon
  • Gradient descent Coming soon

Pipelines for advanced users

Pipelines allow creating more complex machine learning solutions that require multiple parallelizable steps, such as feature extraction, various forms of training, validation actions, and possibly assembling a multi-model from multiple standalone models. Coming soon

Pipelines create a graph that parallelizes all parts it can, you just need to split your job definitions into smaller pieces. Coming soon

Pipeline triggers allow running pipelines on set schedule or when specific event happens e.g. enough new labeled data in a data source. Coming soon