Blog / Half Your Team Is Waiting for the Other Half (And It's Killing You)

Half Your Team Is Waiting for the Other Half (And It's Killing You)

by Rasmus Bentsen | on December 15, 2025

This is Part 1 of the AI Factory Series, a collection of posts on how Lean principles apply to scaling AI teams. We'll cover the eight classic wastes and how they show up in ML/AI work.

Your smartest people are stuck waiting. Not because they're lazy, because your systems are set up to waste their time.

Waiting for a GPU. Waiting for data. Waiting for Bob to explain how he ran that thing last week.

When this happens to every person every day, you're bleeding money and good vibes. Your CFO sees the cloud bill. Your team sees the calendar full of bs meetings. Everyone's miserable.

Here's the thing: this has been solved before, about 30 years ago. Not in AI, but in manufacturing. Lean came to fame as a way to scale manufacturing efficiently. (TL;DR when stupid things happen, stop, fix, continue).

This is one post in our series, each will focus on one of the eight major sources of waste identified in the Lean framework. Today we tackle WAITING.

You might be thinking: "We've already solved the obvious stuff. We have orchestration, we have pipelines, we're not amateurs."

Good. But the mature version of this problem is more subtle:

Spending 10 minutes figuring out how much memory or GPU you actually need. Or just YOLO'ing it and picking a big machine "to be safe", you'll scale it down later (you won't).
The Slack message "which dataset version is the right one?"
The mental math of whether it's faster to figure out how someone else did it, or just redo it yourself from scratch. Usually you pick scratch.
The hesitation to try a new approach because you know setup will be a pain.

Three Ways Your Team Wastes Time Every Single Day

1. Waiting for Compute

What it looks like:

Watching a job queue hoping a GPU frees up.
Avoiding machines where you'd have to set up everything again.
Waiting for someone to spin up a different compute instance type.
Spending 10 minutes trying to figure out what specs you actually need, then just picking something big "to be safe."

Why it's expensive: You have compute. But only some machines feel "usable" because everything else requires painful manual setup. So people wait.

You can obviously buy more hardware. Hope your CFO doesn't ask about utilization.

2. Waiting for Data

What it looks like:

Downloading the same dataset to yet another machine.
Sitting on an expensive GPU while files transfer.
Making a personal copy on a local disk because it's faster than syncing with cloud storage every time.
The weekly Slack thread: "which version of this dataset is the right one?"

Why it's expensive: People stop waiting and start making their own local copies. Over time: you'll have different versions of supposedly the same data, experiments no one can reproduce and models trained on outdated stuff.

You've traded waiting time for a mess nobody can untangle later.

3. Waiting for People (everyone's favorite)

What it looks like:

A job fails and only two people know how to debug it.
A useful script exists but only Bob knows which settings are safe.
The constant calculation: is it faster to find how someone else did this, or just redo it myself? Usually you pick redo.

Why it's expensive: As one customer put it: "The problem isn't one meeting to 'sync up', it's that it repeats every single week, and there are several blocks like that."

Senior people spend half their time explaining basic steps to others instead of doing actual work. The more you scale, the worse this gets. Don't scale your problems.

Get the AI Factory Playbook

This post is part of a series based on the AI Factory Playbook—a practical guide to building an operational backbone that stops teams from drifting, duplicating work, and repeating avoidable mistakes. Download it free.

Why Smart Teams Still Have This Problem

Here's the uncomfortable part: good people hide structural problems longer.

Your best engineer built a workaround. Someone else figured out caching. Another person automated their pipeline with a script in their home directory.

None of this is incompetence. It's smart people moving fast in a system that makes them wait.

Local fixes feel efficient. But they don't scale. And by the time you notice the problem, you've got 15 people with 15 different workarounds, and nobody can use anyone else's work without a 30-minute call.

Why Nobody Finishes Anything Anymore

Nobody likes sitting idle. When forced to wait, people start doing something else.

"I'm not going to sit around, so I start another task. Then Bob is ready, or the GPU frees up, so I drop what I started and jump back."

You can't bundle all the waiting into one clean block. It's constant interruptions: Bob has a question, a GPU freed up, someone needs help.

Five minutes of waiting turns into an hour of lost time because the system constantly forces people to juggle half-finished work.

What This Looks Like From Your Desk

This doesn't show up on dashboards as "waiting." It shows up as:

You have a calendar full of meetings that make you want to cry. All that context switching creates busywork and frustration. Nobody likes waiting. Nobody likes being interrupted. All of this becomes meetings, meetings nobody wants to pave over a problem that shouldn't exist.

You either accept everything takes longer, or you throw money at the problem. Then you explain to your CFO why your cloud bill doubled. Good luck with that conversation.

Adding people makes things slower, not faster. Scaling something that already has problems just means you also scaled your problems. More people means more "how did you do this?" sessions and more waiting for someone to be free.

You can't reliably estimate timelines. When half the work is "waiting for someone" or "figuring out the setup," delivery becomes unpredictable.

Your best people become bottlenecks. Senior people are constantly pulled into basic operations. They burn out. Or they leave. And historical know-how walks out the door with them.

Stop Lying to Yourself

Now a lot of people might say "that's not true, we have systems for versioning and other stuff that works"

Here comes the real question, do you really?

What you say you have: Versioning

What you actually have:

Code versioned (in Git)
Data sometimes versioned (when someone remembers)
Environment rarely versioned (Docker tags that mean nothing)
Parameters never versioned (living in a Google Doc)
Pipeline dependencies tribal (Bob knows)
One-off scripts living in Slack threads

What you say you have: Reusable components

What you actually have:

Pipelines that technically exist
That only Bob can run
Because the docs are outdated
And half the config is hardcoded
So everyone just asks Bob

What you say you have: Debugging tools

What you actually have:

Logs that exist somewhere
That only two people know how to read
So every failure becomes a Slack thread
And those two people are always in meetings

Here are examples of better systems:

Shared job queues: Engineers submit work; they don't hunt for machines.

Reusable runs: If Bob can run it, you can run it, and you actually do.

Self-service debugging: When jobs fail, people see logs and metrics without pulling in specialists.

Systems, not meetings: If you need syncs to explain pipelines, your workflows aren't that discoverable.

Visibility into waste: Can you easily look up things like resource utilization, failure rates, bottlenecks before they become complaints?

How To Fix

Here's another core Lean concept: whenever something dumb happens, stop, huddle, and ask "why on earth are we doing it this way?" Then make it better and get back to work.

Don't just accept these situations as normal. If people are waiting every day, that's not "just how it is." That's a design problem you can fix.

Start With What Bothers You The Most

If any of this sounds familiar, your problem isn't "not enough smart people." You have plenty.

What you don't have yet is a system that treats waiting as a problem worth fixing.

Here's what to actually do:

Ask yourself: what part of this do I hate the most? Then ask your team the same question.
Build a list. Write down everything everyone hates.
Prioritize. Pick the one that's causing the most pain right now.
Fix it. Make it everyone's problem, get it fixed. There's no magic third option.

This is what the AI Factory Playbook is for: turning "this sucks" into a concrete checklist you can act on.

The plan is only half the work, the other is rallying everyone around it. Congratulations, we also have a guide for that -> MLOps Fiiliskierros Facilitator’s Guide

Next in the series: We'll cover another Lean waste that kills AI teams: non-utilized talent. What happens when your best people spend their days on work that doesn't need their brain.

Book a Demo

Enter your email to find a time that works for you.