Let’s talk about packages

April 25, 2018 Python, Security

Recently the Python community got to celebrate the triumphant launch of the new Python Package Index. It’s been a long time coming, and the folks who made it happen all deserve a round of applause.

PyPI is an incredibly important piece of infrastructure for the Python community, and if you use Python you probably use PyPI, directly or indirectly, on a daily basis, and many different sites, services, and companies rely heavily on it.

But: if you rely on PyPI, how often do you think about that reliance? What would you do if an incompatible version of one of your dependencies, or one of your dependencies’ dependencies, suddenly appeared? What if one of them disappeared? Would your applications or your deployments break? What would you do if PyPI had extended downtime? Would you still be able to deploy your applications? What if PyPI got hacked? If someone replaced one or more packages with malicious versions, would you be safe, or would you install the malicious packages and end up running them on your servers?

The people who run PyPI are smart and conscientious and do what they can to keep it running 24/7 and not hacked, but these are still serious questions you should be thinking about. The people who run PyPI certainly are thinking about some of the scarier cases, and working on ways to make them less scary, but there are also things you can do.

So let’s talk a bit about that.

Background

Before diving in, though, I want to make sure everyone’s up to speed on an important and often-misunderstood thing: how to specify dependencies in Python.

Suppose you’re writing a Django application that you want to deploy on a server somewhere; you probably want to have Django available on that server, right? And also have it available in your development environment? So Django is a dependency of your application. How do you say that in a machine-readable way? There are two answers in the Python world, depending on exactly what it is you want to say about your dependencies.

One option is to use the install_requires argument in the setup() call inside your code’s setup.py file (assuming, of course, that you’re building your code to work as an installable Python package). I do this in the Django applications I distribute. Right now, I support Django 1.11 and 2.0, which are the two latest releases supported by the Django team, so my setup.py contains a line that says install_requires=['Django>=1.11',] (I could restrict that further to something like Django>=1.11,<2.1 to be safe when Django 2.1 comes out, and I suppose I might in the future, but some changes in 2.x to the way Django’s deprecation cycles work make me feel like it’s not urgently necessary).

Another option is to use a requirements file. Often, this will be named requirements.txt, and you’ll feed it to pip to install things: pip install -r requirements.txt. You might even have multiple requirements files; there might also be a dev_requirements.txt that installs some nice extra stuff for a dev environment, a test_requirements.txt that installs the tools your test suite needs to run, and so on.

If it seems like there should be one and only one way to do this, remember what I said a minute ago: it depends on what you want to say about your dependencies.

Using a dependency list in the install_requires of setup.py makes two statements:

These are the additional packages you’ll need in order to use my package, and
These are the versions of those packages I’m committed to supporting.

So when I say install_requires=['Django>=1.11',] in one of my Django apps, I’m telling you that you need Django, and that I’m OK with you using Django 1.11, or any newer version.

Using a dependency list in a requirements file makes a very different statement. It says: this is an environment I want to reproduce. While you can use loose version specifications in a requirements file, it’s generally a bad practice to do so. Instead, you should be listing packages at specific versions, because the utility of a requirements file is that it lets you specify exactly what you want to have installed.

Take this site, for example. It uses several of my personal apps, which all have that install_requires=['Django>=1.11',] line in their setup.py files. But it also provides a requirements file which says: Django==1.11.12. As I’m writing this, that’s the latest 1.11 release of Django, and it’s one that I’ve personally installed on my laptop and tested with, so that’s the version I want to have running on the server. That satisfies the dependencies listed by my apps (which don’t care what specific version I use as long as it passes the test of >=1.11), so when they get installed they see an appropriate version of Django is already present, and don’t try to install or reinstall Django.

While there are other use cases for requirements files, this distinction — of stating a general set of dependencies (install_requires) versus describing a specific environment to reproduce (requirements file) — is the biggest one, and the most important to be aware of for what follows below.

Also, just for completeness’ sake, it’s worth noting there are two other files that can sometimes show up in a Python package: setup.cfg can supply much of the same information as setup.py but in a static way (since setup.py is an executable script, and may vary its behavior depending on execution environment — it can, if you don’t supply the full set of arguments to its setup() function, read setup.cfg to find them), and can also supply configuration data for other tools like coverage frameworks and linters.

There’s also the much newer pyproject.toml, which is defined by PEP 518. This solves a chicken-and-egg problem for package creators: setup.py can specify the tools and libraries needed to build their packages, but causes the uncomfortable situation where you need to execute setup.py in order to find out what’s required to execute setup.py. So anything specified in pyproject.toml is not a dependency for installation; it’s a dependency for working with (including building) the package in the first place.

For the rest of this post I’ll mostly be talking about dependencies specified via requirements files, since that’s typically how you’ll want to handle dependencies for deploying things with Python.

First step: use a requirements file

The most obvious way to deploy something is just to write out a requirements file listing your dependencies. For sake of a not-too-complex example, here’s a requirements file that has only two dependencies, Django and the psycopg2 driver for a PostgreSQL database:

Django
psycopg2-binary

(using psycopg2-binary is recommended; it provides the driver already compiled, while psycopg2 is a source-only version that has to compile on your system)

You could just put this in a file named requirements.txt, then have your deployment process do pip install -r requirements.txt, and it would work… for now. But what happens as newer versions of Django and psycopg are released? What if there’s a backwards-incompatible change in one of their new releases?

A better approach is to specify the version you want of each package. Here’s an updated list, using the latest versions as of the time I’m writing this (April 2018):

Django==2.0.4
psycopg2-binary==2.7.4

You could use looser version specifications, but there’s always a risk of something changing even between minor versions, so specifying exact versions is preferred.

But there’s still something missing! This requirements file lists two packages, but if you were to pip install from it you’d get three packages. That’s because Django has a dependency on pytz, specified in Django’s setup.py. So installing Django will also install pytz.

You could add it (pytz==2018.4 being the latest version as I write this) to the list, but this is getting a bit tedious. An easier approach is to pip install packages you’ll want, and then run pip freeze to print out a list of all the packages, including their exact versions, that pip has installed. You’ll want to do this from within a Python virtual environment to keep it isolated from any other packages you might have installed previously.

If you run pip freeze > requirements.txt, you’ll end up with:

Django==2.0.4
psycopg2-binary==2.7.4
pytz==2018.4

And now you have a complete list of the packages you’re using, at the exact versions you tested with as you were developing. You’re done now, right?

Next step: hash checking

Well, what if something went horribly wrong and the Python Package Index got hacked? What if someone were able to replace PyPI’s copy of Django 2.0.4 with something malicious? Would you end up installing it on your servers, and running whatever was inside?

The solution here is to use the hash-checking mode of pip. This lets you include the expected hash of the package in your requirements file; pip will download it, compute the hash, compare to what you supplied, and refuse to install if the hashes don’t match. You’ll get this mode automatically by adding some hashes to your requirements file, but you can also force it on with the --require-hashes command-line flag to pip. I recommend using that flag, since it will fail your deployment if you somehow wind up with a requirements file that doesn’t include hashes, and that’s probably the behavior you want.

Aside: if you want to, you can generate PGP signatures for your packages and upload the signatures to PyPI, but the tooling doesn’t really do anything with them at this point. Mostly, PGP-signed packages are not the panacea people often think they are, for a public anyone-can-publish repository like PyPI. Linux distros declare a certain small set of signing keys to be trusted, and strictly control who can put packages in their repositories, while a service like PyPI doesn’t and mostly can’t.

Of course, this requires you to figure out the hash you expect from the package. You can do this manually: run pip download and pass a list of packages/versions, or pip download -r requirements.txt, to get a local copy of each package, then run pip hash on each one. For example, using the Django 2.0.4 package:

$ pip download "Django==2.0.4"
# … download messages omitted here …
$ pip hash Django-2.0.4-py3-none-any.whl
Django-2.0.4-py3-none-any.whl:
--hash=sha256:2d8b9eed8815f172a8e898678ae4289a5e9176bc08295676eff4228dd638ea61

There is also a tool you can use to automate this process: hashin. It can be run on a single package at a time — like hashin "Django==2.0.4" — or you can have it process all your packages by using a Python (or even bash) for loop over the items in your requirements file, feeding them to hashin.

After doing this, your requirements file will contain hashes for each package, though be aware that hashin will generate a hash for every variant of the package. For example, here’s what you’ll end up with for Django 2.0.4:

Django==2.0.4 \
    --hash=sha256:2d8b9eed8815f172a8e898678ae4289a5e9176bc08295676eff4228dd638ea61 \
    --hash=sha256:d81a1652963c81488e709729a80b510394050e312f386037f26b54912a3a10d0

This is because Django 2.0.4 is on PyPI in two forms, a “source distribution” (.tar.gz file) and a “wheel” distribution (.whl). The distinction is that a source distribution has a “build” substep during installation, which may compile extensions, gather files to figure out where to put them, etc., while a wheel contains everything pre-built for a specific Python version and operating system, and only needs to unpack the files and put them in the right locations. The name “wheel” is a bit of a tortured reference: Python the language got its name from its creator being a fan of Monty Python, so the community uses a lot of Monty Python references. The original name for PyPI was the Python Cheese Shop, after a famous Monty Python comedy sketch, and cheese comes in big round blocks called wheels.

Anyway, the result is that Django uploads two packages for each release. And the listing for psycopg2-binary will be even worse, since it will have one hash for each compiled .whl variant, and there are twenty-seven of those representing different platforms it’s pre-compiled for. That makes the --verbose flag to hashin pretty useful; it will show you the hashes it’s calculating on a per-package basis, letting you spot things like the platform-specific precompiled .whl you’re going to install, and cut out all the others from your requirements file.

Also, be aware that the hash-checking mode of pip is all-or-nothing. You’ll have to specify your entire set of packages, including their dependencies, their dependencies’ dependencies, and so on, along with hashes for all of them. If you leave out even one package, or one hash, the entire install will bail out and yell at you for not having all the packages and hashes listed. This is why you probably want to automate as much of this as possible with tools like pip freeze and hashin.

A side note: pipenv

It’s worth pausing here to mention pipenv, which is a tool that aims to take most of the tedium out of this. It’s a wrapper around both pip and Python’s virtual-environment tools, and will automatically create a virtual environment for you if you’re not already in one, will track every package you install along with all their dependencies, and will maintain a record of the exact versions of those packages and their hashes. It also lets you maintain a base list of packages plus extra dev-only dependencies, and integrates checks against lists of packages with known security issues, to warn you when you need to upgrade.

There’s a lot of traction around pipenv, and it’s now the officially recommended way to manage dependencies during development.

However, I will mention two caveats about pipenv:

It does not use requirements files to track packages. Instead it uses its own pair of files (Pipfile to store the “source” information in the form of packages you explicitly installed, and Pipfile.lock to list every package in the resulting dependency tree at the exact installed version, along with hashes). Which means you will end up with your dependencies specified in a way that, currently, only pipenv understands. You can “import” from a requirements file into pipenv‘s way of doing things, and “export” back to a requirements file, but you will lose package-hash information when you do that. So if you want hash-checking (and you probably do…), you’ll need to fill that in yourself.
Currently, pipenv bundles copies of many of its own dependencies, including a copy of pip. In the case of pip, this is necessary because pipenv uses internal APIs from pip that aren’t guaranteed-stable (and in fact have changed in more recent versions). This isn’t necessarily bad, but is something I’d like to see remedied before I personally start relying on it too heavily.

Next step: private repository

While most people — including companies I’ve worked for — tend to use the public Python Package Index to get most of their packages, it’s not always possible to put your own packages on PyPI. If you’ve got something that you can’t or aren’t ready to release open-source, for example, but still need to use, it’s convenient to be able to pip install it during deployment.

There are some workarounds for this, but increasingly I’ve seen companies use private package repositories as a solution. The idea is that you still get all your public, open-source packages from PyPI, but your internal-only packages get downloaded and installed from a package index only you can access. There are multiple services and pieces of software which can do this. Gemfury offers private package repositories as a service for several programming languages’ package systems and a couple flavors of Linux package managers. Artifactory offers package repositories as a service, or a self-hosted version, again for a variety of packaging systems, including Python’s. And if you just want Python, devpi is a self-hosted and open-source private Python package repository, while pip2pi can convert a requirements file into a static directory layout containing all the packages, suitable for running behind a web server as a static package index. And I’m sure there are other services and products out there doing substantially similar things.

All of which is very handy. But it also addresses a potential problem: what would you do if PyPI went down?

Several of the private package repository solutions have the ability not just to host your own packages, but also to mirror and cache packages from the public PyPI. Which means that even if PyPI is down or inaccessible, you can still pip install everything you rely on. That’s a pretty nice thing to have (in fact, I run a devpi instance on my laptop, in part so that I can still install and work with things even when I don’t have an internet connection).

Using a private repository that mirrors/caches PyPI also lets you tighten things up even more: most PyPI mirroring solutions let you explicitly decide which packages to mirror, meaning you can — so long as you hit your private repository — be certain your installs will work, and get only and exactly what you expect, every time you pip install.

There are some other useful features in several of the private-repository approaches, like multi-index support; I’ve used this with devpi before to set up a “production” package index along with “test” and “staging” indexes where I was trying out variations of packages I wasn’t yet ready to fully release. I’m personally a big fan of devpi, but most of the services and packages out there seem to be pretty good and have converged on similar feature sets.

There is one risk you need to be aware of when using a private repository, though.

A common trap I see when configuring a private repository — even, sometimes, in the documentation for the software or service providing the private repo — is to use the --extra-index-url argument to pip, or other static configuration to set the private repository as an “extra” index to use. When you do this, and both PyPI and your private package repository have a package of the same name, pip can end up installing the one from PyPI. This can happen accidentally, or maliciously. If you’re being careful about specifying exact versions and hashes you won’t end up actually installing and executing the code from the conflicting public package, but you can make that situation completely impossible. Using only your private repo, and having it mirror PyPI for the public packages you want, is the most straightforward approach to add some additional defense against this.

Recommendations

If the wall of text above seems a bit intimidating, don’t worry — I’ve tried to throw as much useful information as I can into a single blog post, and I know that can end up being a bit dense.

So to wrap up, here’s what I’d recommend as a process for anyone who builds things that depend on Python packages and/or the Python Package Index:

Do use a requirements file to track all of your dependencies, and use either pip freeze from a known-good environment, or pipenv, to put together the list of packages.
Do make sure you specify every package at the exact version you want to use (that is, the == version-comparison operator).
Do use pip with the hash-checking mode, which means including hashes in your requirements file. I know it seems like a pain when you first set it up, but it’s worth it.
Consider using a private package repository, even if it’s just to mirror PyPI. Most of them are easy to set up, configure and use, and save you from potential headaches if something ever does go wrong with PyPI.

Finally, one thing I saved for the very end because it doesn’t really fit anywhere else and is really important:

Do use something that monitors your dependencies and can alert you if there’s a security issue in a package you’re using. You can do this manually using the safety library (and, as mentioned, pipenv can do this for you automatically), or get a nice dashboard with constant updates, and even pull-request status checks, by using a service like pyup.io or requires.io.

If you only do one of these things, security monitoring is probably the most important one; you don’t want to get caught using an old insecure version of something. As for the rest, do them as you’re able; if you’re starting out a new project, that’s a great opportunity to get things set up nicely from the beginning, but even if you’re on an existing project that’ll take some time to switch over, I think all the recommendations above are worth doing, both for peace of mind and for easier deployments and maintenance going forward.