Makefile: the secret weapon for ML project managementJuha Kiili
If you are a machine learning pioneer, asking the computer entirely novel questions, there is a good chance that the one-click-tooling is not there on a silver platter. The likely scenario is gluing together technologies with duct tape. And in the digital world, the digital duct tape lives in the command line (CLI).
CLI usage is common when the technology is still nascent, and companies haven't had time to build nice buttons and sliders. The written commands are not as convenient, but they compensate for being flexible and powerful. Functionalities are often available only via CLI or API first. Designing and building UI is very labor-intensive, after all. Especially true for integrations, as potential combinations between products are exponential.
But with great power comes great responsibility. There is no record-keeping or documentation by default. The shell has a short-term history, but that doesn't help your team or even future you. You keep googling for those magic commands and grasping particular parameters repeatedly. In an extreme case, a manually typed command could even lead to typos that accidentally take down your entire production cluster.
What if I told you there is a simple, free, lightweight tool - invented almost 50 years ago - for weaponizing any CLI-based ML project. It can document and re-parameterize all essential commands and dependencies between them, relieving the poor human brain from memorizing everything. All the commands are nicely wrapped and accessible via shortcut aliases and only a TAB keypress away. This tool is easily installable and super robust for all operating systems! All the legendary UNIX tools have a pretty short name, and this is not an exception. It is called Make.
What is Make?
Make is originally a tool to compile source code into binary executables. If you ever had to build some library straight from source code, you have probably used it. In a modern machine learning project, we rarely compile any source code. Instead, we propose repurposing it to power up the ML project CLI usage.
Benefits for an ML project:
- Document commands into a single human-readable text file.
- Execute commands with short aliases and TAB completion.
- Avoid re-running commands unnecessarily.
The default way to use Make is to create a text file called Makefile into the root directory of your project.
This file contains all the commands you'd call in the context of your ML project. You should always commit this file into your version control system.
Let's look at what a Makefile looks like for a typical ML project.
DOCKER_IMAGE := mycompany/myproject VERSION := $(shell git describe --always --dirty --long) default: echo "See readme" init: pip install -r requirements.txt pip install -r requirements-dev.txt cp -u .env.template .env build-image: docker build . -f ./Dockerfile -t $(DOCKER_IMAGE):$(VERSION) push-image: docker push $(DOCKER_IMAGE):$(VERSION) pin-dependencies: pip install -U pip-tools pip-compile requirements.in pip-compile requirements-dev.in upgrade-dependencies: pip install -U pip pip-tools pip-compile -U requirements.in pip-compile -U requirements-dev.in data: mkdir -p data data/training.csv: data curl https://filesamples.com/samples/document/csv/sample1.csv > data/training.csv echo "Downloaded data/training.csv" train: data/training.csv python train.py --learning-rate 0.0001 --dataset data/training.csv
In this example, we can see a few typical commands for an ML project listed. There is some dependency management, building & pushing Docker images to a repository and downloading data for a local training run. All the good stuff.
- Define targets (commands)
- Define dependencies between targets
- Define variables
Let's go through them one by one.
The commands are actually called "targets" in Make-speak. This terminology comes from the original use case for compiling programs.
The syntax for running a command (or compiling a target) is:
$ make <target>
$ make build-image
This runs the target called
build-image from our example
Defining and running targets is the fundamental functionality we are after. It lets us document all the important commands into a single readable text file. Everyone in the team can execute any command by typing in "make" and hitting the TAB key for a list. It sounds so simple, yet it is hard to overstate the warm feeling of cloning a new repository with a Makefile. If Readme is the repository check-in, Makefile is the VIP lounge.
We are not talking about Python dependencies here. By dependencies, we mean dependencies between the targets.
For example, if you run a local training run, you depend on having the training data on your hard disk. With Make, you can indicate that a target depends on another target.
Before understanding target dependency syntax, we must realize that Make sees each target as a file (or a folder) on your disk. For example, if you executed a target called "hello," Make searches for "hello" on your disk to see if it is already there. It will only run your target if it doesn't find that file. We don't need to re-run the target if the file already exists. In the ML project context, it could prevent us from downloading the same training data repeatedly, for example.
In our example
data: mkdir -p data data/training.csv: data curl https://filesamples.com/samples/document/csv/sample1.csv > data/training.csv echo "Downloaded data/training.csv" train: data/training.csv python train.py --learning-rate 0.0001 --dataset data/training.csv
Here we have three targets:
Make sees everything as a target, but in reality:
- data is a folder.
- data/training.csv is a file.
- train is a command
- train has a dependency on data/training.csv
- data/training.csv has a dependency on data
Now, when we run our train target, Make will check if we already have the data/training.csv around. If not, it will download it for us. Before downloading, it will also check if the data folder already exists and create it if necessary. You can think of target dependencies as file system dependencies.
Like with all programming, you want to avoid repetition. Make has a simple variable system to do just that. You can get far without variables, but they will make your Makefile much cleaner.
In our example Makefile, we used variables for two things. Firstly, we define our Docker image name once instead of repeating it needlessly. Secondly, we utilize git to generate a version tag for our images. We could additionally update this tag to other places like YAML configuration files that refer to it. This would prevent us from using the notorious latest tag, which you should never use in production. Read more about the "perils of the latest" from our eBook.
VERSION := $(shell git describe --always --dirty --long)
This generates a hash based on your current local git repository status. The cool thing is that using VERSION as your Docker image tag will make your life much easier. The Docker image tag now matches the git commit hash. This way, you can reliably reproduce the exact state of the code repository that generated that image. Critical detail when debugging those hard-to-reproduce bugs in production.
I've been building all my recent ML projects around a Makefile, which has been a great decision. The rule of thumb is that whenever I find myself running the same shell command twice, it goes to the Makefile, no questions asked. My future self is always delighted to find all these magic spells well documented. It is like an interactive version of the Readme file.
If you want to read more tips and tricks like this for managing your ML projects, we recommend checking out the Engineering Practices for Data Scientists eBook.