If you're running on-premise machines, you've likely dealt with the frustration of GPUs sitting unused while waiting for a single big job to clear. Valohai's new dynamic GPU allocation feature is about one thing: doing more with less. It's about ensuring all those expensive GPUs aren't just sitting around while one job hogs the machine. Instead, you get to specify exactly how many GPUs each job needs so you can stack up tasks and keep those GPUs cranking.
But can't I already do that with GPU sharing?
Sure, you can split GPUs across multiple virtual machines, or use something like Nvidia’s Multi-Instance GPU to segment them. But here's the catch: with those setups, the split is fixed. If you've divided up an 8-GPU machine into two 4-GPU VMs, the data scientists using those VMs to run their ML jobs can't change that split on the fly. Someone who only needs one GPU will have to underutilize a VM with four GPUs, and someone who needs five GPUs is out of luck.
With Valohai's dynamic GPU allocation, data scientists can adjust GPU usage per job. That means each time they create a new ML job, whether using the UI, the CLI, or in valohai.yaml, they can specify the exact number of GPUs they need on the spot. Lightweight experiment? Allocate one GPU. Major training run? Go ahead, grab all of them. Each job gets only the resources it requires, no more, no less.
How Valohai keeps it all in check
Behind the scenes, Valohai's queuing system is purpose-built to prioritize the runs in a way that makes the most of your GPU resources. Jobs get assigned to the right GPUs based on the resources they need. Jobs that require more GPUs than what's available are queued up, but they don't block the queue. If a smaller job comes along that can use the available GPUs, it gets a priority boost, so it can run right away. This way, you can keep the GPUs busy and reduce idle time.
However, occasionally, you'll have big jobs that need all the horsepower they can get. To make sure these don't get permanently sidelined, Valohai includes a timeout setting, so large jobs aren't indefinitely stuck in the queue, waiting for that last GPU to free up. It's fair sharing without the bottlenecks.
The bottom line
With Valohai's dynamic GPU allocation, you get:
- Optimized GPU usage: Choose the exact GPU count per job, freeing up resources to run more jobs simultaneously.
- Higher throughput, shorter waits: By prioritizing smaller tasks, Valohai ensures multiple jobs can run side by side.
- Flexible control: Allocate GPUs on-demand per job and set timeouts to prevent queue bottlenecks.
- Cost efficiency: Make the most of your GPU investment by keeping them busy.
Check out our detailed documentation or book a meeting with our Customer Team!