What every data scientist should know about Python dependenciesJuha Kiili
What is dependency management anyway?
Dependency management is the act of managing all the external pieces that your project relies on. It has the risk profile of a sewage system. When it works, you don’t even know it’s there, but it becomes excruciating and almost impossible to ignore when it fails.
Every project is built on someone else's sweat and tears. Those days when an engineer woke up, made coffee, and started a new project by writing a bootloader – the program that boots up your computer from scratch – are history. There are massive stacks of software and libraries beneath us. We are simply sprinkling our own thin layer of sugar on top.
My computer has a different stack of software than yours. Not only are the stacks different, but they are forever changing. It is amazing how anything works, but it does. All thanks to the sewage system of dependency management and a lot of smart people abstracting the layers so that we can just call our favorite pandas function and get predictable results.
Basics of Python dependency management
Let's make one thing clear. Simply Installing and upgrading Python packages is not dependency management. Dependency management is documenting the required environment for your project and making it easy and deterministic for others to reproduce it.
You could write installation instructions on a piece of paper. You could write them in your source code comments. You could even hardcode the install commands straight into the program. Dependency management? Yes. Recommended? Nope.
The recommended way is to decouple the dependency information from the code in a standardized, reproducible, widely-accepted format. This allows version pinning and easy deterministic installation. There are many options, but we’ll describe the classic combination of pip and requirements.txt file in this article.
But before we go there, let's first introduce the atomic unit of Python dependency: the package.
What is a package?
"Package" is a well-defined term in Python. Terms like library, framework, toolkit are not. We will use the term "package" for the remainder of this article, even for the things that some refer to as libraries, frameworks, or toolkits.
A module is everything defined in a single Python file (classes, functions, etc.).
A package is a collection of modules.
Pandas is a package, Matplotlib is a package, print()-function is not a package. The purpose of a package is to be an easily distributable, reusable, and versioned collection of modules with well-defined dependencies to other packages.
You are probably working with packages every day by referring to them in your code with the Python
The art of installing packages
While you could install packages by simply downloading them manually to your project, the most common way to install a package is via PyPi (Python Package Index) using the famous
pip install command.
Note: Never use
sudo pip install. Never. It is like running a virus. The results are unpredictable and will cause your future self major pain.
Never install Python packages globally either. Always use virtual environments.
What are virtual environments?
Python virtual environment is a safe bubble. You should create a protective bubble around all the projects on your local computer. If you don't, the projects will hurt each other. Don't let the sewage system leak!
If you call
pip install pandas outside the bubble, it will be installed globally. This is bad. The world moves forward and so do packages. One project needs the Matplotlib of 2019 and the other wants the 2021 version. A single global installation can't serve both projects. So protective bubbles are a necessity. Let's look at how to use them.
Go to your project root directory and create a virtual environment:
python3 -m venv mybubble
Now we have a bubble, but we are not in it yet. Let's go in!
Now we are in the bubble. Your terminal should show the virtual environment name in parenthesis like this:
Now that we are in the bubble, installing packages is safe. From now on, any pip install command will only have effects inside the virtual environment. Any code you run will only use the packages inside the bubble.
If you list the installed packages you should see a very short list of currently installed default packages (like the pip itself).
pip list Package Version ------------- ------- pip 20.0.2 pkg-resources 0.0.0 setuptools 44.0.0
This listing is no longer for all the Python packages in your machine, but all the Python packages inside your virtual environment. Also, note that the Python version used inside the bubble is the Python version you used to create the bubble.
To leave the bubble, simply call
Always create virtual environments for all your local projects and run your code inside those bubble(s). The pain from conflicting package versions between projects is the kind of pain that makes people quit their jobs. Don't be one of those people.
What is version pinning?
Imagine you have a project that depends on Pandas package and you want to communicate that to the rest of the world (and your future self). Should be easy, right?
First of all, it is risky to just say: "You need Pandas".
The less risky option is "You need Pandas 1.2.1", but even that is not always enough.
Let's say you are correctly pinning the Pandas version to 1.2.1. Pandas itself has a dependency for numpy, but unfortunately doesn't pin the dependency to an exact numpy version. Pandas itself just says "You need numpy" and does not pin to an exact version.
At first, everything is fine, but after six months, a new numpy version 1.19.6 is released with a showstopper bug.
Now if someone installs your project, they'll get pandas 1.2.1 with buggy numpy 1.19.6, and probably a few gray hairs as your software spits weird errors. The sewage system is leaking. The installation process was not deterministic!
The most reliable way is to pin everything. Pin the dependencies of the dependencies of the dependencies of the dependencies, of the… You'll get the point. Pin'em as deep as the rabbit hole goes. Luckily there are tools that make this happen for you.
Note: If you are building a reusable package and not a typical project, you should not pin it so aggressively (this is why Pandas doesn't pin to the exact Numpy version). It is considered best practice for the end-user of the package to decide what and how aggressively pin. If you as a package creator pin everything, then you close that door from the end-user.
How do I pin Python dependencies?
Whenever you call
pip install to get some hot new package into your project, you should stop and think for a second. This will create a new dependency for your project. How do I document this?
You should write down new libraries and their version number to a requirements.txt file. It is a format understood by pip to install multiple packages in one go.
pip install -r requirements.txt
This is already much better than most data science projects that one encounters, but we can still do better. Remember the recursive dependency rabbit hole from the previous chapter about version pinning. How do we make the installation more deterministic?
The answer is
pip-compile command and
requirements.in text file.
# Auto-generate requirements.txt
# Generated requirements.txt
cycler==0.11.0 # via matplotlib kiwisolver==1.3.2 # via matplotlib matplotlib==3.4.0 # via -r requirements.in numpy==1.22.0 # via # matplotlib # pandas pandas==1.2.1 # via -r requirements.in pillow==9.0.0 # via matplotlib pyparsing==3.0.6 # via matplotlib python-dateutil==2.8.2 # via # matplotlib # pandas pytz==2021.3 # via pandas six==1.16.0 # via python-dateutil
In the requirements.in you should only put your direct dependencies.
The pip-compile will then generate the perfect pinning of all the libraries into the requirements.txt, which provides all the information for a deterministic installation. Easy peasy! Remember to commit both files into your git repository, too.
How to pin the Python version?
Pinning the Python version is tricky. There is no straightforward way to pin the version dependency for Python itself (without using e.g conda).
You could make a Python package out of your project, which lets you define the Python version in the
setup.cfg with the key
python_requires>=3.9, but that is overkill for a typical data science project, which usually doesn't have the characteristics of a reusable package anyway.
If you are really serious about pinning to specific Python, you could also do something like this in your code:
import sys if sys.version_info < (3,9): sys.exit("Python >= 3.9 required.")
The most bullet-proof way to force the Python version is to use Docker containers, which we will talk about in the next chapter!
Don't avoid dependency management - Your future self will appreciate the documented dependencies when you pour coffee all over your MacBook.
Always use virtual environments on your local computer - Trying out that esoteric Python library with 2 GitHub stars is no big deal when you are safely inside the protective bubble.
Pinning versions is better than not pinning - Version pinning protects from packages moving forward when your project is not.
Packages change a lot, Python not so much - Even a single package can have dozens of nested dependencies and they are constantly changing, but Python is relatively stable and future-proof.
What about the cloud?
When your project matures enough and elevates into the cloud and into production, you should look into pinning the entire environment and not just the Python stuff.
This is where Docker containers are your best friend as they not only let you pin the Python version but anything inside the operating system. It is like a virtual environment but on a bigger scale.
Our next chapter will cover everything you need to know about Docker so stay tuned!
Want more practical engineering tips?
Data scientists are increasingly part of R&D teams and working on production systems, which means the data science and engineering domains are colliding. We want to make it easier for data scientists without an engineering background to learn the fundamental best practices of engineering.
We are compiling a guide on engineering topics that we hear data science practitioners think about, including Git, Docker, cloud infrastructure and model serving.
Download the preview edition of the Engineering Practices for Data Scientists and receive the full eBook when we release it.