Boring Python: code quality

December 19, 2022 Django, Python

This is the second in a series of posts I intend to write about how to build, deploy, and manage Python applications in as boring a way as possible. In the first post in the series I gave a definition of what I mean by “boring”, and it’s worth revisiting:

I don’t mean “reliable” or “bug-free” or “no incidents”. While there is some overlap, and some of the things I’ll be recommending can help to reduce bugs, I also want to be clear: there will be bugs. There will be incidents where a feature or maybe an entire service is down. “Boring”, to me, is about the sources of those incidents.

In the first post I talked about dependency management, and laid out an approach to it designed to ensure (as much as possible) it doesn’t become a source of unexpected bugs. Today I want to talk about what’s generally called “code quality” — tools to help you identify potential bugs, and other problems, as early as possible, ideally before they ever get merged into your codebase, let alone deployed to production.

But a quick reminder before we go much further: I’m writing from the perspective of someone who primarily uses Python as a backend language for web applications and other networked services. That’s what I know best, and there will be a strong web-oriented flavor to this post. If you use Python for something else, hopefully there will still be some useful takeaways from this post, but I can’t promise everything here will be relevant to you.

This also is not a “list all the tools and let you make up your mind” post — there are plenty of those out there. This is an opinionated set of things I personally recommend, along with explanations of why I recommend them. If you don’t like my recommendations, there are other ways to achieve a lot of the goals I’ve set out here, and I encourage you to look into them.

Finally, if you’re just interested in the end result, scroll down to the section titled “putting it all together”. It’s a pretty simple setup once everything’s been introduced, but introducing the tools and techniques, and why I recommend them, takes a lot more words.

The basics

Before I mention anything else, I have to start with the absolute basics: version control, tests, continuous integration (CI). If you don’t have these, there is nothing I can recommend that will have a larger impact than adding them. Without version control you don’t have a way to track what’s changed in your code over time; without tests you don’t have a way to know whether your code works; without CI you don’t have a way to automatically find out when your code has been broken by recent changes. All of these are huge.

For version control: probably just use git like everyone else.

For testing: pick either the unittest module in the standard library, or the third-party pytest test framework. I have a preference for unittest, and if you’re doing Django its built-in testing tools are all built around unittest, but pytest has plenty of fans and there are third-party plugins to re-create some of the nice Django testing tools on top of pytest.

For CI: I’ve used a lot of different CI systems over the years, and they all had things I liked and things I disliked. So I won’t recommend any particular CI tool other than to say that it’s better to have any CI than to wait around to find the perfect CI, and so if you’re having trouble deciding it’s probably best to just stop worrying and go with whatever’s built in to your code hosting platform (GitHub Actions, GitLab CI, etc.).

There also are tools in the Python ecosystem that will help you organize and orchestrate automated checks for both local and CI use. Historically, this is the spot where I would have mentioned and recommended tox, but at the moment I’m exploring alternatives to it, such as nox, and as a result I don’t currently have a specific recommendation for such a tool. I’ll update this post if that changes.

On coverage

Test coverage — automatically measuring which, and what percentage, of lines of code in your actual codebase (your application, or library, or whatever it is you’re building) get executed during your test runs — is a controversial topic. The short summary is:

In favor: Coverage measurement points out parts of your codebase that aren’t being tested, and if you know everything in your codebase gets executed during test runs, you should have confidence that the code does what it’s supposed to.
Against: Coverage measurements are too easy to “game” — you can get to 100% coverage without meaningfully testing all or even most of your code, and so it’s yet another example of Goodhart’s Law (“When a measure becomes a target, it ceases to be a good measure”).

Personally, I don’t think 100% coverage as a target is a good idea, but I still think you should be measuring and reporting coverage during your test runs. Instead of treating it as a target, I like to treat it as a warning: I want to know if my coverage suddenly drops, because that’s likely a sign something else has gone wrong in either the main codebase or the test suite.

For Python testing, the coverage module lets you gather and report coverage metrics. There’s also a plugin for pytest.

Non-Python-specific tools

Next, there are a few things that aren’t specifically tied to Python or any other programming language, but are still extremely useful to have.

The first is the ignore file for your version control system. This file is called .gitignore if you’re using git, .hgignore if you’re using Mercurial, and other names for other version control tools — look up the right one for the tool you use.

GitHub maintains a good starter .gitignore for Python that you can just copy and use, and it will keep you from accidentally committing a lot of things that you probably didn’t want to commit, like the cache directories used by a lot of common Python tools and processes.

If your project builds a Docker container, also create a .dockerignore file to specify files and directories that should be excluded from the container.

The next tool is EditorConfig — this is set up by creating a file named .editorconfig in the root of your repository, which can be used to tell lots of different popular IDEs/text editors some basics about how to work with your project. For example, you can use the .editorconfig to specify whether you want files indented with spaces or with tabs (on a per-language basis), how much indentation should be used at each level, what newline style to use, and a lot more. I usually start with a copy of Django’s .editorconfig file and remove the bits I don’t need. I strongly recommend doing this, since adding the .editorconfig file to your project will avoid at least some manual configuration of IDEs/editors and get a bit of automatic consistency across everyone who works on your project, even if they don’t use the same IDE/editor you do.

One final language-agnostic tool that’s good to set up is pre-commit, which — if you’re using git as your version-control system — makes it easy to plug in various checks to automatically run each time you try to make a commit, and which can either automatically fix problems for you, or just reject the commit and tell you what went wrong. Several important tools I’ll be recommending here have pre-commit hooks available, but to start with just set up a pre-commit config file and add a few of the built-in hooks. I recommend using at least the following:

check-added-large-files
check-ast
check-byte-order-marker
check-case-conflict
check-docstring-first
check-executables-have-shebangs
check-json
check-merge-conflict
check-toml
check-yaml
debug-statements
detect-aws-credentials
detect-private-key
end-of-file-fixer
requirements-txt-fixer
trailing-whitespace

These will catch/fix a lot of common problems, from simple syntax errors all the way up to accidentally committing sensitive values.

For projects with multiple people working on them via a branching workflow, no-commit-to-branch is also useful — it lets you specify a list of branches that can’t be directly committed to (default set is main and master), so that people don’t accidentally commit locally to the primary branch of the repository instead of their own working branch. This hook regularly saves me from having to undo an accidental local commit to main because I forgot which branch I was on.

Code style and formatting

This may seem like a weird thing to treat as “quality”, but one of the most important factors in code quality, in my experience, is readability. You’re going to need to read your colleagues’ code, and they will need to read yours. You’ll also need to read your own code that you wrote months or years ago. And you should make that job easier through consistent style and formatting.

If you followed the advice in the last section, you’ve already got a head start on consistency of formatting from your .editorconfig and pre-commit hooks handling things like indenting, newline styles, and so on, but now it’s time to get into Python-specific formatting. I recommend using two tools together: Black and isort.

Black can fix up all sorts of Python syntax constructs into a consistent style. And the great thing about it is that it ends arguments about code formatting, because Black deliberately does not provide much ability to configure its style. It just is what it is. Meanwhile, isort will reformat blocks of Python import statements to follow a consistent style and to place them in a consistent order: standard-library imports first, then third-party modules, then your own code’s modules.

For configuration, the only option I recommend setting for Black is the target Python version, which should match the version of Python you intend to deploy on (for libraries that target multiple Python versions it’s more complex, but for an application you intend to deploy on your own servers, you should be using one Python version and targeting it). For isort, I recommend setting the “profile” option to "black" to tell isort to match Black’s preferences. If it has trouble recognizing which modules are part of your codebase and which aren’t, consider setting known_first_party and/or known_third_party to help it out.

In both cases, I recommend putting the configuration in a top-level pyproject.toml file in your repository. In the case of Black this is the only supported configuration file, and most other tools support using pyproject.toml as a centralized configuration file now — there’s only one I’m going to recommend (flake8) that doesn’t.

As far as running Black and isort, here’s a set of recommendations:

Any editor/IDE that can set up to run Black and isort to reformat code automatically every time a file is saved should be set up to do that.
But just in case someone on your team doesn’t do this, set up the pre-commit hooks for both Black and isort; these will automatically reformat files at commit time.
In CI, Black and isort should both run, but not reformat code; both tools support a mode where they’ll just exit non-zero and print out the things that need reformatting, and that’s what you should use here.

This can’t fix all the code readability problems you might run into, but at least it’s a good start.

Linting

Once you’ve got consistently-formatted code, it’s time to start thinking about linting. The two most popular linters in the Python world are flake8 and Pylint, and while there’s some overlap in what they check for, there are also some pretty significant differences.

A massively oversimplified version of the differences between the two would be:

flake8 is generally faster and will raise fewer false positives, but checks/enforces fewer things.
Pylint is generally slower and will have more false positives, but checks/enforces a lot more things.

Two other things to be aware of about Pylint are that it requires everything, including all of your dependencies, to be importable during the run (in order to check for things like correct usage of imported modules/functions/classes), and that if you use a library or framework that does significant metaprogramming you’ll probably need a plugin to help Pylint understand that. For example, if you use Django you’ll pretty much always also use pylint-django.

My advice is to pick and run at least one of flake8 or Pylint. If you choose Pylint and you already have a large existing codebase, turn on its various checks gradually; Pylint’s own documentation recommends and explains how to do this. A nice hybrid option is to run Pylint as part of your CI suite, and flake8 in pre-commit: Pylint is kind of tricky to use from a pre-commit hook and can be slow, while flake8 has an official pre-commit hook and is speedy.

Also, if you use flake8 in any capacity, I recommend including the flake8-bugbear plugin, which adds a bunch of useful checks on top of the normal flake8 suite.

One more linter I strongly recommend is Bandit, which is focused mainly on potential security issues, and knows how to check for a lot of them. Just make sure to disable check B101 (which by default forbids use of assert) when running it over unit-test code.

Documentation checks

Most programmers have thankfully come to understand the value of version control and automated testing. Unfortunately, there are many who are still, for whatever reason, skeptical of documentation. I don’t know why; I love documentation and treat it as an essential part of all my projects. I also love that Python has great tooling for it.

At an absolute minimum, you should be making use of Python’s ability to embed documentation alongside code via docstrings. To enforce that, I recommend you use interrogate, which will tell you if you have any modules, classes, methods, or functions that don’t have docstrings.

But really you should be writing more documentation than just docstrings (though you still should have docstrings), and you should be using a proper documentation tool. In the Python world that’s Sphinx. If you’re someone who absolutely refuses ever to use anything other than Markdown for writing, I highly recommend you still at least give Sphinx’s native reStructuredText a try. And if you still hate it you can use Sphinx with Markdown documents via plugins.

One huge benefit of Sphinx is its rich annotation and cross-reference support, which lets you refer not just to other parts of your own codebase, but to other Sphinx-documented codebases, including Python itself (the core language and the standard library) and many popular frameworks and packages like Django.

You can also use the Sphinx autodoc plugin to pull in docstrings from your codebase, but keep in mind that your documentation should be more than just an auto-generated API reference.

My recommendation for documentation checks is:

In pre-commit and CI, run interrogate (already mentioned) to enforce the presence of docstrings.
In CI, run a Sphinx build to prove the documentation builds without errors.
In CI, use sphinxcontrib-spelling to spell-check your documentation.
Mark code samples in your documentation with the doctest directive and, in CI, run the Sphinx doctest plugin to ensure those code samples work.

That last one may be a bit controversial, but let me reiterate: this is not saying you should try to embed the full test suite of your codebase into the docstrings. It’s saying that any examples you put in your documentation to help people learn how to use your codebase need to be checked for correctness, and that is what doctest is for.

Packaging checks

If you’re working on something that’s intended to be built into a package and distributed, there are a few more tools that are useful to run.

First, you should install the build package and run python -m build to prove your packaging works; build is frontend that understands many different package-building backends and can generate the standard Python package formats (.tar.gz and .whl). If python -m build fails, it’s a strong sign something’s wrong with your packaging.

Then I recommend running a few more checks:

check-manifest builds your package, compares its contents to the files tracked by your version control system, and raises an error if the file lists don’t match. This is incredibly useful for catching when you’ve added files in version control but forgotten to update your packaging configuration to match, and can also attempt to fix the package manifest if you tell it to.
check-wheel can alert you to a number of common problems with your package, such as empty packages, accidentally including Python bytecode files, and several other issues.
pyroma will give your package a rating based on presence or absence of key package metadata values like project classifiers/keywords, license, supported Python versions, etc.
The twine check utility of the twine package uploader can warn you if your package’s description won’t render properly on the Python Package Index.

If you’re on GitHub Actions as your CI, consider using Hynek’s build-and-inspect action to apply several of these automaticallyy.

The elep-hint in the room

You may have noticed that none of the sections above talked about type annotations and static type checkers. That’s deliberate, because I’m not going to be recommending that you run a static type checker over your Python code. There are multiple such checkers available right now, in various states of maturity, and you should feel free to use one if you want to.

My own personal approach is to add type annotations to code whenever possible: they’re useful as documentation (but they should not be the only documentation you have!), and both documentation tools and many editors/IDEs will automatically pick up on them. But I don’t run a static type checker like mypy, for several reasons.

First and most importantly, idiomatic Python code tends to be more toward the structural-typed, or perhaps “interface-ly typed”, end of the spectrum. For example, you basically never care whether something is exactly of type list, you care about things like whether you can iterate over it or index into it. Yet the Python type-annotation ecosystem was strongly oriented around nominal typing (i.e., caring that something is an instance/subclass of a particular named type) from the beginning. Things have started to get better there, but support for Protocol (the official name for an interface) didn’t land in the standard library’s typing module until Python 3.8, and until then you either needed to use third-party implementations or, worse, tell the type checker “just trust me when I say this implements the necessary interface”, because there was no way to define what the interface looked like and have the type checker understand it.

Edit: A lot of people are having trouble with the above paragraph and think it’s just about whether or not things like typing.Iterable existed. It isn’t! For much of the early history of Python type checkers, even if you specified something was a “protocol type” like Iterable or Mapping, there was still no way to tell a type checker what that actually meant, and so type checkers could only enforce things like “is a subclass of typing.Iterable” (which is a nominal-typing approach) rather than things like “implements the required methods to actually be iterable”. So, for example, you could write a subclass of Iterable, fail to implement the necessary interface, and the type checker would still pass because it only cared about seeing Iterable somewhere in your parent types. It wasn’t until Python 3.8 and the implementation of PEP 544 that Python’s type-annotation ecosystem gained a standardized way to specify interfaces and have them actually enforced structurally (rather than just nominally) by a type checker. If you’re still having trouble getting the difference: imagine if Java had initially launched with “interface” checking that only looked for an implements clause containing the right name, and not for the corresponding set of required members in the class itself. That’s more or less what Python did, and is what I’m complaining about.

And even with progress on things like protocols, I think type hints are still not really all the way there. For example, annotating higher-order functions — which are another crucial part of idiomatic Python, not least in the form of decorators — is still a pretty rough experience, and although there have been proposals to make it better, it’s going to be a while before that gets good enough for me to want to use.

And I already pointed out that I’m primarily a web developer. I rely heavily on libraries and frameworks — particularly ORMs like Django’s ORM, or SQLAlchemy — which do lots of runtime metaprogramming and as a result still don’t have great type-checking stories. There’s work underway, of course, and every so often I check out what progress has been made, but in my opinion the experience just isn’t good enough yet.

So, again, I encourage you to add type annotations to your code — I do! — but once they’re in place I personally use them only as documentation, and do not currently use a static type checker. You should feel free to use a static checker if you want one.

Safely invoking tools

For some of the tools I’ve recommended, and for some situations in which they’re run, you don’t really get to control how they’re invoked. For example, projects that provide their own pre-commit hooks decide how they should be invoked by the hook, and all the code to do that lives in their repository.

But at least in your CI, you do have control over this, and I recommend taking some care in how you do so.

First: although most of the things I’ve recommended do install standalone command-line entry points — like black for the Black code formatter — they also support being run as Python modules, like: python -m black. And I strongly recommend you do this whenever possible. I mentioned this in the last post, and Python core developer Brett Cannon has a good explanation of it for the specific case of pip, but the general idea is: when you’re working with lots of Python installs/virtual environments, it’s a good idea to be as specific as possible in the invocation, to make sure you get the Python you think you’re getting. So, for example, if you want to run something using Python 3.9, python3.9 -m <name of thing> is preferable. And I personally like to extend that to all my CI tools.

Another advantage of this is that you can pass additional command-line flags to Python. One that’s a really good idea for CI — especially on public/open-source projects — is the -I flag. That’s “I” as in “Isolated”, which is what it stands for. This flag removes some implicit directories from the import path and ignores all environment variables that configure/change Python’s behavior, which is useful when you don’t necessarily trust everything you might run.

One example of a problem this will prevent: the current working directory is, normally, implicitly on the import path, which means a malicious pull request might, say, drop a file with the same name as a standard-library module into the directory you invoke your CI tooling from, and that module would then be imported and run by anything that tried to import the standard-library module of the same name. Running Python in isolated mode prevents this and other similar tricks.

Finally, at least for your test suite, I recommend adding one more command-line flag to the invocation of Python. Python supports warnings as a way to signal things that aren’t errors, or maybe aren’t yet errors, but that might still be useful to know about. One of the most important is DeprecationWarning, which is raised by Python and many third-party libraries to signal that particular APIs are in the process of being deprecated, and will one day be removed. But by default, Python does not display deprecation warnings, which means it’s easy to miss them. Since you probably want to know about deprecations sooner rather than later, I recommend using the -W command-line flag to change this behavior. Specifically, passing -Wonce::DeprecationWarning will show you deprecation warnings, but only the first time each particular location in the code raises a warning (so that you don’t get spammed with huge numbers of them if your code is frequently calling a deprecated API).

So if, say, you use pytest to run your tests, my recommendation is not to do this:

pytest # pytest arguments here…

But instead this:

python -Wonce::DeprecationWarning -Im pytest # pytest arguments here…

If you’re using an automation tool like tox or nox, there usually is some method of getting the specific Python interpreter for the current environment — in tox the {envpython} substitution, and in nox the session.python attribute (which returns the version number as a string, so you can construct the correct interpreter invocation as f"python{session.python}").

Putting it all together

If you just want a summary of all the recommendations, here you go.

Must-have: version control, unit tests, automated test runs/continuous integration. Turn on coverage reports, but use them as a warning (if the coverage level suddenly drops) rather than as a target.

Set up ignore files for both your version-control system and Docker (if you’re using it).

Set up EditorConfig and pre-commit.

Run at least one of flake8 (with flake8-bugbear plugin) or Pylint as a linter, and run the Bandit security linter.

Document your code with Sphinx. Use its autodoc plugin to pull docstrings from your code. Use intersphinx to cross-reference other projects (like Python, or libraries/frameworks you use), sphinxcontrib-spelling to spell-check your documentation, and the Sphinx doctest plugin. to check the correctness of any example code in your documentation. Enforce the presence of docstrings in all code with Interrogate.

If you’re building a package to be distributed/installed, run build, twine check, check-manifest, check-wheel-contents, and pyroma.

Add type annotations to code, if for no other reason than to use as documentation. Whether to run a static type checker, and which one, is up to you.

As for where to use each tool:

Have developers on your team configure their editor/IDE (if possible) to automatically run the Black and isort code formatters every time they save a file

In pre-commit, run:

Black and isort, letting them reformat code
The flake8 linter
The Bandit security linter
Interrogate
The list of built-in pre-commit hooks listed in the full section above

In CI, invoke tools by explicitly naming the Python interpreter to use and invoking the tool as a module with the -m flag (i.e., python3.10 -m pytest instead of just pytest), putting the interpreter in “isolated” mode with the -I flag, and for your main test suite run with deprecation warnings enabled via -Wonce::DeprecationWarning. And run these:

Black and isort, but don’t let them reformat; instead have them error if they would reformat
At least one of flake8/Pylint
Bandit
Interrogate
Documentation build
Documentation spell-check
Test any example code in the documentation with Sphinx’s doctest plugin
If you’re building a package, run the package checks: python -m build, twine check, check-manifest, check-wheel-contents, pyroma

Until next time

As with last time, that was a lot of words to cover what turns out to be not a particularly complex set of recommendations; it’s just explaining the “why” of everything that takes a while. And hopefully now you have an idea of a “boring” Python code-quality regimen; this won’t prevent every bug or problem you might introduce into your code, but it will help to catch a lot of potential issues.

And as before, even if you don’t adopt the recommendations I’ve given, I’d like to think that seeing them laid out and explained will at least be helpful to you, and that you’ll learn something you can take away and put to use in whatever setup you do choose to adopt.