A Python Packaging Carol

January 5, 2020 Django, Programming, Python

I have endeavoured in this Ghostly little book, to raise the Ghost of an Idea, which shall not put my readers out of humour with themselves, with each other, with the season, or with me. May it haunt their houses pleasantly, and no one wish to lay it.

Every year around Christmas, I make a point of re-reading Charles Dickens’ A Christmas Carol. If you’ve only ever been exposed to the story through adaptations into other media, I highly recommend getting a copy of Dickens’ original text (it’s public domain and you can read it for free from Project Gutenberg). It’s a good story, and Dickens has a wonderful way with the language of his era.

This year’s read of Dickens inspired the framing for a post about Python packaging, but a combination of end-of-year work and, well, Christmas means that this post may just barely be published within the liturgical season of Christmas, and only if I hurry to finish it (I’m writing much of this on Twelfth Night itself); otherwise, it may not appear until Epiphany or even later. Hopefully you will, in keeping with the season, forgive me that lateness.

Packaging’s Ghost

Quite often, I see people being wrong on the internet about Python packaging. But the way in which they’re wrong is subtle, and often passes unnoticed. The issue with much of the discussion is in conflating multiple different things under the term “packaging”, and failing to be clear exactly which of them is being discussed or criticized. In the spirit of Dickens, I’d like to break up the concept of “packaging” into at least three different topics, each associated with a specific use case, and then talk a bit about each of them:

The Spirit of Distributable Artifacts
The Spirit of Installing Packaged Code
The Spirit of Managing Environments

Each of these is part of “packaging”, but each deserves its own independent treatment, and especially each deserves to be praised or criticized on its own merits or flaws, and not those of the others. So let us, like Ebenezer Scrooge, await the visit of…

The First of the Three Spirits

I’ve chosen to call this one “Distributable Artifacts”, which is a bit of a mouthful, but at least is accurate. The use case here is: you have some Python code you’ve written. It works, and does the thing(s) you want it to do. Can you produce, from it, some type of (likely metadata-bearing) artifact which could be used to get the same code running somewhere else?

The answer is that of course you can. And doing so is not difficult. The Python packaging ecosystem provides you with a set of default, if sometimes a bit low-level, tools for handling the key use cases of packaging, and for this one the answer is: use setuptools to write a setup.py file, then run that setup.py file with arguments to produce the type of artifact you want (for example, to get a .tar.gz source distribution, you’d run setup.py sdist, while for a .whl package you’d run setup.py bdist_wheel).

But what, I hear you saying, about all those blog posts and forum comments bemoaning hundreds of tools and thousands of problems and the utter brokenness and impossibility of doing this?

I wish I knew. I suspect a lot of it is just people repeating and amplifying other comments they’ve seen, without knowing of their own knowledge of any problems. Writing a setup.py file and using it to produce a distributable artifact simply isn’t a terribly complex task most of the time; if you’re someone who already knows Python the programming language well enough to be able to produce working code you’d like to distribute, you can probably learn to write, and then write, a working setup.py file for it in an hour or so.

Of course, I did put a caveat in there: “most of the time”. But the cases where producing a distributable artifact from working Python code is difficult are the cases that were already going to be difficult no matter what Python or its ecosystem did. Most often, they’re cases where the thing to be distributed includes not just Python source code, but also compiled extensions written in at least one and perhaps multiple other languages, each with its own dependencies.

Let’s consider the specific example of SciPy, which is a very popular package and one of the foundations of Python’s scientific-computing stack. Building SciPy manually from source requires up to five different interpreters/compilers (a Python interpreter, a C compiler, a C++ compiler, a Fortran compiler, and depending on how you build possibly also the Cython compiler), plus a variety of headers/libraries for the C/C++ bits to link against, and a pre-existing installation of the (Fortran) BLAS and LAPACK linear-algebra packages.

I don’t know of any language packaging system which makes it easy to produce a single package that seamlessly ties together code in the host language plus three others, each potentially with its own language-specific dependencies. I also don’t think it’s particularly reasonable to expect a language packaging system to make this easy; just making it possible at all is pretty good. Python’s default packaging toolchain does make this possible, and that’s all I feel I can reasonably ask of it.

So now let us climb back into our warm snug beds to await the visit of…

The Second of the Three Spirits

I’ve called this one “Installing Packaged Code”, and that’s a bit more self-explanatory name. The idea here is: someone has written some code and, following a visit from the Spirit of Distributable Artifacts, has produced a package from it. You would now like to install that package somewhere and make use of the code in it.

And again, you can do this, and do it quite easily. The standard tool in the Python packaging ecosystem for this is pip, which is the package installer. Want to install Django? pip install Django will do it for you. Want a specific version? pip install "Django==3.0.2" (that’s the latest release as I’m writing this). There are also a couple variant options if you want to have Django install optional dependencies for some of its add-ons.

But what about a terribly complex package like SciPy? You can pip install scipy. Because the Spirit of Distributable Artifacts has made this possible, SciPy provides pre-compiled packages for a variety of Python versions, operating systems, and processor architectures. And it works! I once had the misfortune of needing to build SciPy from source, and it was an hours-long process. But while writing this paragraph I paused, spun up a virtualenv, and ran pip intall scipy in it. The time utility tells me it took 57.924 seconds, most of which was spent on downloading the pre-built packages for SciPy and its dependency NumPy.

The most common complication in this use case is pinning dependencies. Typically, Python packages specify their dependencies broadly, with a range of compatible versions. For example, I’m working right now on updating all my personal open-source Django apps following last month’s release of Django 3.0, and in my setup.py I’ll be declaring something like Django>=2.2,<3.1 to declare that I test on and support a range of versions of Django. But when deploying this site, for example, I currently pin an exact version of Django (right now, Django 2.2.9).

This is useful for a few reasons: it lets you take advantage of pip‘s ability to verify package hashes when installing an exactly-specified version of something, and it lets you be sure that what you’re deploying to your production environment contains exactly the same code you were running and testing in your development and staging environment.

But it does require a bit of extra work. This is what a requirements file is for: that’s a file, usually named requirements.txt, which is passed to pip install with the -r command-line flag. When you do that the exact-pinned versions in the requirements file are used to narrow the broad compatibility ranges specified by the various packages. Hence the common refrain that requirements files aren’t for specifying dependencies; they’re for reproducing a specific known environment.

Once again, though, this is making simple things easy, and making slightly more complex things at least possible (and for most people, not that much more difficult than the base case of installing a single package via plain pip install).

So where is the horrid complex awfulness of Python packaging? Well, as in Dickens’ work, we now must tremble in fear as we await the visit of a grim, hooded figure:

The Last of the Spirits

I called this one “Managing Environments”. And I’ll be honest: it is by far the most complex of the three, and the one most likely to produce frustration. Although, again, much of the complexity scales according to what you’re doing, and the complexity isn’t necessarily due to bad decisions made by the authors of the tools (at least, not decisions that were bad at the time they were made).

When I say “managing environments”, what I mean is: at some point as a Python developer you will write some Python code, and you will need to be able to run it and its tests and have access to all its dependencies. You may also need to be able to deploy it somewhere else, and again have it run with all its dependencies. And you may need to do this for:

Multiple different projects at once
Multiple different versions of Python
Multiple different versions of dependencies
Multiple different versions of the code itself
Conflicting or incompatible dependencies

Or all of the above. Simultaneously. And you often will need them all to be able to work, on the same machine (either your local development machine, or the environment you’re deploying to, or both).

And this… gets complicated, because different versions of Python, and different or conflicting or incompatible sets of dependencies, and different versions of your code, all have a tendency not to get along when you try to use them all simultaneously in a single environment.

Some of the difficulty here comes from Python’s age. While it’s not as old as some of the languages mentioned further up in this post, it is older than many people seem to realize — older than Java! — and is a product of its time. In the early 1990s, and even for some time afterward, the standard for Unix-y languages was to use system-wide shared locations for libraries, or at best a system-wide shared location plus a per-user location, to keep different users’ work from interfering with each other.

A brief aside on what a Python installation looks like

It’s worth pausing for a moment here to understand what actually makes up an installation of Python (assuming you’ve installed the CPython interpreter from python.org — system package managers may lay it out differently!). Effectively, it’s a set of directories and files whose locations are determined relative to a prefix the Python interpreter knows about. The important ones for our purposes are:

A directory named bin/ which gets added to your $PATH, and contains the Python interpreter. Often, executable Python scripts like pip also go in here.
A directory named lib/ containing a subdirectory named pythonX.Y after the version of Python. For example, on Python 3.8, this is lib/python3.8/.
Inside that directory will be a copy of the Python standard library, along with another directory named site-packages/.
Inside the site-packages/ directory are packages you’ve installed using this Python installation; this directory is automatically included on your $PYTHONPATH.

This means that packages you install go into… a system-wide (or, at least, Python-installation-wide) shared location. Which makes it awkward to have two different versions of a package installed simultaneously, or to install two different packages whose dependencies conflict. Python does support a per-user location as well, but this doesn’t solve the problem.

The Last of the Spirits (continued)

The standard answer here is the virtual environment, also known as “virtualenv” or “venv” (provided, now, in the Python standard library by the module appropriately named venv). A virtual environment is a copy of the above layout — bin/ with its files, lib/ with its site-packages/, etc. — and setting its own prefix to tell the Python interpreter to use its copy of that layout instead of a system-wide one. And that’s just about it: the only other major thing in a virtualenv is a little convenience script (technically, several scripts, for different shells and platforms) that can “activate” and “deactivate” the virtualenv, which mostly consists of changing environment variables to ensure your shell finds the virtualenv’s Python interpreter (which is usually just a symlink to the “parent” interpreter that created the virtualenv) ahead of any others. You can install things to your heart’s content, and they’ll go into the site-packages of the virtualenv, visible only to that virtualenv’s interpreter (unless you go and manually set $PYTHONPATH to point there).

Want to try it out? Type this in a shell, and it will create a virtualenv named “my-test-venv” in your current directory:

python -m venv my-test-venv

You can poke around in it and see the various bits, including a symlink to the “parent” interpreter which created it, the activation scripts, and its isolated lib/ with site-packages/. Want to use it? Type source bin/activate inside the virtualenv (or invoke the appropriate script for your shell — by default, it generates activation scripts for bash-compatible, csh, fish, and Powershell). Then you can do whatever you like in it without affecting your system-wide Python. And when you’re done, type deactivate, and you can delete the my-test-venv directory.

So virtual environments are a handy solution to this problem. They’re lightweight, fast and easy to create, easy to get rid of when you don’t need them later on, and they provide clean isolation of different sets of installed packages from each other.

This is, incidentally, how more recently-created languages mostly do things: rather than have a shared system-wide location for dependencies they’ll import or link against, the more common approach now is for each project to default to its own isolated environment and pull and build against its own copies and versions of dependencies.

And the virtualenv solution extends not just to different sets of libraries, but to completely different Python interpreters. I currently have ten different Python interpreters, for various releases of both CPython and PyPy, sitting on my personal laptop, and I use virtualenvs tied to particular interpreters for most of the open-source work I do.

There is a problem here, though: virtual environments are very much a low-level interface. So too are setup.py files and pip and requirements files, but while people generally don’t mind writing a setup.py file for the first use-case, or writing out pip commands and/or simple requirements files for the second, many people do seem to balk at manually creating and managing a bunch of virtual environments and all their installed packages.

There is, currently, no single standard high-level tool in the Python ecosystem to do this for you. There are multiple competing projects which either target this use case specifically, or solve it as part of trying to provide a unified high-level workflow for all the packaging use-cases.

The end of it

If you’re still with me after all that, I will now finally come to the point: when people complain about “Python packaging”, in my experience they are almost always complaining not about the distributable-artifact use-case, or the installing-packaged-code use-case, but rather about the managing-environments use-case. As I’ve explained above, this is the one that’s most complicated, and for which people seem most desperately to want a high-level tool to manage that complexity.

I am not going to recommend, in this post, any particular high-level tool. I’m not sold on any of them, and for the most part I manage my virtualenvs manually. I also don’t use that many virtualenvs directly: for my open-source projects, I use tox as the test runner, and it manages the matrix of environments automatically during the test run. And at my day job, my dev environment is provided by a set of Docker containers which have to account for far more than just Python code.

But I will recommend that, next time you see someone ranting about “Python packaging”, you think back to what I’ve said here, and investigate a bit on your own to see just which part of it they’re complaining about. The moral of this story is that actual packaging — producing, distributing, and installing metadata-bearing artifacts which reproduce working code in a new location, as covered in the first two use-cases above — is pretty good, as people have been pointing out for a while now. There is complexity, but as I hope I’ve demonstrated, the complexity is present when the thing you need to do genuinely is complicated all on its own.

The hard part — the part that probably does deserve criticism (some of the stronger rants go a bit too far, in my opinion, in attacking particular people/projects or misrepresenting which of the use-cases they’re complaining about) — is that third use-case of managing environments. I don’t know what the solution will be there, but I’d very much like to see a good solution to it that mostly unifies the community.

And yes, I’m aware that there is another sort of debate, less publicized and largely coming from the scientific-Python world, which consists more of “why don’t all you other Python users scrap all that pip stuff and just use conda instead” (conda being a separate and much more system-package-manager-like toolchain, heavily used in the scientific Python world with its pile of complex multi-language dependencies though not so well-known outside of it), but that’s a debate for another day.

In the meantime, I belatedly wish you a merry Christmas (or appropriate seasonal holiday of your choice), a happy new year, and a solemn moment of respect for the passing of Python 2. And as Dickens might say were he a Python programmer: May Guido bless us, every one.