MLOps Confessions - Things We've Heard (and Maybe Lived)

These are real confessions from real ML practitioners — collected from Slack DMs, calls, onboarding sessions, and the occasional broken pipeline. Some are funny. Some are painful. All are familiar.

Read them, nod along, cringe a little. And if any of them sound like you? You're not alone — and we've got resources to help.

Feeling seen? Run your own team checkup — grab the guide.

Download the guide

I still deploy models with a cron job and a prayer.

It wasn't meant to be *the* solution—it was just the quickest way to get that one model into production. I figured we'd clean it up after… but then a new fire popped up, and suddenly this hack became "our deployment strategy."

I accidentally overwrote our best model… because I reran the training script.

I didn't have a versioning system—just UUIDs in filenames and a loose hope that 'finalmodelv2actualbest.pkl' would survive. Turns out, my own script defaulted to saving in the same path. We still don't know which version got shipped.

Our infra was basically Jaakko's brain.

Jaakko built a script to handle uploading models to GCS, so we all started using it. Then we realized it didn't handle access tokens correctly if you weren't using his laptop setup. But Jaakko had already moved to another team. Now it's a legacy script we all rely on but don't fully understand.

We promote models by copy-pasting S3 links from Slack messages.

Technically we have automation—it posts updates to a channel when jobs finish. Then someone finds the message, opens the experiment dashboard, copies the S3 link, pastes it into a new job config, and so on. Human-in-the-loop ML at its finest.

We track experiment metadata in a shared Google Sheet with 12 tabs.

We *were* going to build something more robust. But the sheet just grew. Now it has color-coded cells, six slightly different column formats, and one person who knows how to use the filters. Thankfully, Sheets has good versioning.

I use ChatGPT to debug and iterate on my code instead of the IDE.

Yeah, I know there are AI features in VSCode—but I'm faster pasting the whole error into ChatGPT and saying 'what did I do wrong?' It's like having a rubber duck that explains obscure Python packaging issues.

Everyone runs experiments inside their own Docker container on on-prem machines… and we pray nobody kills the process.

The containers just sit there running. People SSH in and connect VS Code to them. If anyone reboots the machine or wipes the container, that work is gone. Pushing to Git is slow and painful, so most of it never makes it out.

All our notebooks end up in a folder called 'project-2023-retrospective' even when they're not.

It's supposed to be temporary. We move fast, naming things later. The cleanup script? It just broke half the references, so we gave up. Now we rely on intuition and commit timestamps to know which notebook is the good one.

We retrained a model on the wrong dataset for two weeks—and I looked like an idiot in the review meeting.

The dataset name hadn't changed, but the contents had. I thought I was optimizing a breakthrough, but really I was just curve-fitting on outdated labels. Everyone was excited… until we deployed and saw accuracy tank.

We once forgot to retrain a model for 6 months… because the pipeline didn't alert anyone when it failed.

It was running in Airflow, and the DAG looked green. But one silent failure early on meant we kept shipping stale predictions. Nobody noticed until a stakeholder asked why the model still referenced data from *last quarter*. Totally looked like I'm on top of this *facepalm*.

We had to refactor our training script just to make the platform run it.

It didn't like how we handled arguments, so we had to wrap everything in a main function, switch to their logger, and flatten our directory structure. It works now—but it feels like pointless work.

I ran a multi-day training job on the wrong dataset… and didn't notice.

The path was printed in the logs, but I didn't check. There's no clear way to see what data version was used unless you dig. I only realized it when results didn't match up. So I reran it manually, hoping I got it right this time.

We built a custom UI to make our ML tools easier to use… and now I'm stuck supporting it.

It started as a quick frontend so people could trigger jobs. Now people ask for features, report bugs, and treat me like the product team. Did we get more hands on this? No.

We use three different ML tracking tools across projects… and tracing anything is a mess.

Most teams use MLflow—just on different instances. One team uses W&B. Models don't move between teams often, but sometimes the data does… and then it's anyone's guess where to look.

A compliance request turned into a massive Slack thread that no one wanted to join.

They asked how the model was trained. That experiment wasn't fully tracked. So we started guessing, sharing commits, digging through logs, and hoping someone remembered the details.

Our pipeline broke months ago… so now we just trigger jobs manually.

It's in the backlog. We'll get to it. For now, we SSH into the runner, change a few lines, and run with it. There's a script called finalworkingrun.sh. That's our CI.

I once faked a lineage graph for a leadership presentation.

We didn't have full tracking for the experiment chain, so I opened draw.io, dragged some boxes around, and said, "Here's how it works." It got approved.

We adopted a new tool after a cool demo… and now we can't get rid of it.

Someone saw it at a meetup, gave it a shot, and it kind of worked. Then we wrote wrappers, added jobs, and used it in customer work. Now it's woven in deep—and we're stuck.

I can't run half our pipelines because I'm stuck in a maze of access approvals.

One role controls compute, another storage, another secrets. But I found a way to get a local copy. So, whatever. I'll run with that.

Our 'production pipeline' is just a notebook exported to Python and thrown in Airflow.

There are still markdown headers in the code. It technically runs end to end, but it's a copy-paste job wrapped in retries. Nobody wants to refactor it because it "works."

I spend more time debugging YAML files than building models.

My job title says "Machine Learning Engineer" but I'm really a "Configuration File Wrangler." Every pipeline change requires updates to multiple YAML configs across different systems. I've memorized the exact whitespace needed for each file. A single misplaced tab can waste hours of compute time with cryptic errors.

None of our models are actually reproducible, we just pretend they are.

We have all this versioning for code and data, but something's always slightly different between runs. Different hardware. Different library versions. Subtle randomness we can't control. So when someone asks "can we reproduce the exact results from last month's experiment?" I smile and say "of course" while silently panicking.

We have one GPU server that we fight over through passive-aggressive Slack messages.

It started with a "courtesy" system of announcing when you're using the GPU. Now people book it days in advance. Someone created an unofficial signup sheet that half the team ignores. I've started running jobs at 2 AM to avoid the politics.

Our dataset versioning strategy is basically adding "v2" "final" "actuallyfinal" to filenames.

Our dataset versioning strategy is basically "datav1.csv", "datav1fixed.csv", "datav1final.csv", "datav1finalACTUALLY_FINAL.csv"... I've lost track of which one we used for which experiment. When someone asks which version produced a specific result, I have to check file creation dates and make an educated guess.

Our onboarding doc says "just ask Mark"—but Mark left 18 months ago.

Every time we onboard someone, we tell them to read the wiki. Then we immediately say "but that part's not up to date" for literally every section they ask about. At this point, tribal knowledge is our only documentation. We keep saying we'll fix it after this sprint, but that's been the plan for the past 6 quarters.

Our production fallback strategy is "text Daniel, he'll know what to do."

We have monitors, alerts, and dashboards—but when things really break, it always needs human intervention. Daniel is the only one who really understands how all the pieces connect. We've been meaning to document the recovery process, but Daniel keeps solving problems faster than we can write them down.

I do all the real work locally, then recreate it on the platform for show.

The "official" platform is so cumbersome that I develop everything on my laptop first. Once it works, I carefully replicate each step in the platform so it looks like I followed the proper workflow. It doubles my work, but the alternative is fighting with the platform's opinions about how my code should be structured.

Our ML deployment process requires 18 manual steps across 5 different systems.

We've tried to automate it three different times. Each attempt added more complexity without removing manual steps. Now we have a Google Doc titled "Extremely Important: Production Deployment Procedure" that's 7 pages long with screenshots. The last step is literally "Say a prayer and click deploy."

Our "temporary" proof-of-concept has been running in production for 3 years.

We built a quick proof of concept to show the business what was possible. They loved it and started using it immediately. We kept saying we'd replace it with a "proper" solution, but the business didn't see why we needed to fix something that worked. Now it's mission-critical infrastructure that everyone's afraid to touch.

We avoid updating our ML libraries because no one wants to debug the inevitable pipeline breakage.

We're running libraries that are 2-3 years out of date because every time we've tried to upgrade, something breaks in a non-obvious way. Our requirements.txt has specific version pins with comments like "DO NOT UPGRADE - breaks model serialization" and "version 2.5+ causes GPU memory leak."

Ready to Run Your Own Vibe Check?

If any of these confessions hit a little too close to home, it might be time for a fiiliskierros. Our free facilitator's guide has everything you need to unpack your own ML workflows — with less cringe and more clarity.

Download the guide