Python packaging for developers in a hurry

08 January 2023

Python packaging is hard (I will spare you that xkcd). But it can be manageable if you’ve hit your head on the wall enough times to know what to do and what to avoid.

“Wait a minute, is this guy about to tell me my problems are imaginary and point me to yet another magical tool that does it all?” you ask yourself. No, of course not! That would be psychological abuse.

Sadly, this is a long mountainous road that you’ll mostly have to walk by yourself. I hope that my experience gives you a head start in figuring out what is best for you.

Even though the title says “Python packaging”, I will cover four main topics:

Managing Python environments
Managing dependencies
Structuring a Python project
Packaging and publishing

If you already have a well-rounded workflow for any of those, feel free to skip some sections (you don’t have a lot of time, after all). I know the entire post is long, but I hope that you can just navigate the sections, find what you need, and go back to work.

If you’re reaaaally in a hurry and just want the code, here’s the repo: giovannipcarvalho/sample-python-project

“Is this for me?”

If you’re overwhelmed by the amount of information on these subjects, this article strives to be a comprehensive summary of most things you’ll need to consider. It also demonstrates a tried and tested workflow that may work well for you.

However, if you have the time and patience, then reading the official resources from PyPA (Python Packaging Authority), the relevant PEPs (Python Enhancement Proposals), and at least one of the many guides on managing Python versions will be a more thorough – albeit longer – approach.

Managing Python environments

In this section I assume we agree that your system’s Python is for your system, not for you. If you rely on it for your project you better be doing something that is fully-compatible with the way your system works.

“I need no dependencies, just a Python interpreter”

Then fine, you can use your system’s pre-installed Python.

If you need different (perhaps multiple) Python versions than what comes with your system, or if need to install new dependencies (which might interfere with your own system’s dependencies and break it) you are better off isolating them with virtual environments.

“My system’s version is okay, but I need to add some dependencies”

Use Python’s builtin venv to create virtual environments. On some distros it might not come pre-installed so you’ll need to install python3-venv to get it.

Usage:

python -m venv .venv  # create the environment
. .venv/bin/activate  # activate it
deactivate            # deactivate when you're done

On Windows, I recommend you use git-bash (shipped with git-for-windows) if WSL (Windows Subsystem for Linux) is too slow on your machine.

# on git-bash for windows
python -m venv .venv      # same command
. .venv/Scripts/activate  # slightly different
deactivate                # deactivate when you're done

“venv is too slow or too easy to break”

Python’s builtin venv greatest advantage is that you don’t need any extra dependencies. If you have problems with it, use virtualenv instead. More often than not you can just install it from your distro’s repository and be done with.

If not, follow the official instructions. Worst-case scenario, you should be able to materialize it out of thin air using their zipapp, as long as you have Python.

Once installed, the workflow is pretty much the same:

virtualenv .venv      # slightly different than venv
. .venv/bin/activate  # same activation command as venv
deactivate            # deactivate when you're done

“I’m on Windows. HELP!”, or

“My system does not have the Python version I want – or any Python version at all”, or

“I need compiled dependencies (or non-Python dependencies) which are hard to install”, or

“I want a consistent workflow across Linux, Windows and macOS”

Short answer: conda (actually mamba, using the mambaforge distribution)

Long answer:

If you’re disappointed with my answer, I am sorry but our paths have diverged and I am no longer a better guide for you than yourself. If you’re not on Windows, or don’t need compiled and other non-Python dependencies, you might still want to check out pyenv or asdf.

Anaconda is an entire Python distribution that comes with MANY packages, including the conda package and virtual environment manager. I am not recommending Anaconda. Use conda, or rather mamba – the compatible C++ reimplementation that is faster – if you can avoid Anaconda entirely.

The mamba-forge distribution is a lightweight Python distribution that comes with mamba and conda-forge pre-configured as the default repository. Pick the installer from here according to your platform and architecture and follow the instructions for unix-like systems or Windows.

note: While virtualenv does support multiple Python versions, you need to get them installed first, as it merely finds and uses them to create new environments, but won’t install them for you.

Without spending too much more time justifying my choice:

mamba (the package manager) officially supports Linux, Windows and macOS (all of which I need to use)
the conda-forge repository has multiple Python versions, from Python 2.x (God help you) to the more recent 3.x versions
it’s the only sane way to install MKL/BLAS-accelerated NumPy and CUDA-accelerated PyTorch/Tensorflow that I know of (Gohlke’s builds are not a thing anymore)
it allows me to rootlessly install non-Python dependencies, such as gcc and other language’s toolsets (node, rust, go-lang, etc)

Here’s how to use it:

mamba create -n myproj  # create environment
mamba activate myproj   # activate it
# get to work
mamba deactivate        # deactivate when you're done

An added bonus is that you can create a environment.yml at the root of your project and simply run:
mamba env create or mamba env update.

# file: environment.yml
name: myproj   # set this to the actual environment name you want to use
channels:
  - conda-forge
dependencies:
  - python=3.10

I recommend that you install from conda-forge only the packages that you cannot obtain from PyPI (the official Python package index). That is because mamba is usually slower than pip in resolving dependencies, and because pip has better ways to separately declare direct and transitive dependencies (conda-lock is not as good, IMO). We’ll get to why this is important in the next section.

If you are not going to need to upgrade dependencies (short-term or a one-off project), it’s better to have your environment.yml include all your dependencies so that you can reproduce it later, if needed (use conda export for that).

If you are going to maintain the project for long enough and expect to upgrade some dependencies, make your environment.yml contain your non-pip dependencies (such as Python itself) and manage your pip-installable dependencies with a better tool.

note: I think venv and virtualenv still have their place in this setup: for short-lived or disposable environments where you just want make some small tests (e.g. test if a newer version of a library does what you want, without having to upgrade or mess with your current mamba environment for the project), since they will be usually faster than mamba.

Troubleshooting your Python environment

Nine out of ten times, you’re using the wrong python, the wrong pip, or the wrong environment altogether.

Take note of the output of the following commands:

which python
which pip
which whichever-other-command-youre-running  # e.g. pytest or jupyter

Compare their base paths and identify from which virtual environment they are coming from. Are they different? They shouldn’t be.

Are they the same? Check your $PATH environment variable for anything Python related.

# use git-bash if on Windows
echo $PATH | tr ':' '\n'

A very common one is Windows 10+’s default Python (somewhere over %APPDATA%/Microsoft/WindowsApps) taking priority over your desired virtual environment. Get rid of it.

Managing dependencies

I assume you value having a consistent way to reproduce your environment. In essence, you want to:

Not be susceptible to hard-to-track bugs that only happen with some mysterious combination of dependencies
- i.e. ensure you’re running your application with dependencies you have tested against
Track your primary dependencies and transitive dependencies (the dependencies of your dependencies)
Easily and selectively upgrade dependencies, rather than always upgrading all at once

Poetry actually does it all and a bit more (including automatically creating virtual environments for you), but then again, you miss out on packages that are only (or more easily) installable from conda repositories, and in my experience it is very fragile on Windows.

important: If you’re going with Poetry, remember it should live outside your environment – i.e. do not include it in your environment.yml, if you plan to use it along with mamba.

I won’t go in details into the other disadvantages of poetry, but I will just emphasize that it has way too many dependencies (it currently pulls in a total 44 dependencies, according to my test in a clean environment).

“What should I use?”

If you’re developing a tool or application, use pip-tools.

If you’re developing a library, use nothing. Let your users decide, and give them as much flexibility as possible to maximize compatibility (don’t pin, unless to exclude some known-to-fail version ranges). In other words, only declare your direct dependencies under install_requires in your setup.cfg and let pip do the rest.

pip-tools is actually a combination of two tools: pip-compile and pip-sync. It is very lightweight and only pulls in 6 dependencies (tested in a clean environment). It also produces reasonably human-readable requirements.txt files, with pinned dependencies followed by a comment showing their parent package:

asgiref==3.2.3
    # via django
django==3.0.3
    # via -r requirements.in

Which makes identifying why some dependency was installed very easy compared to poetry’s lock file.

Here are the steps:

echo einops >> requirements.in  # declares a new dependency

# generate or update requirements.txt with pinned versions
# by default, it pins them to the latest available and compatible versions
pip-compile

# at a later point in time, when you need to selectively upgrade a package
pip-compile --upgrade-package einops
# will upgrade the package to the latest available and compatible version

# sync environment with requirements.txt
pip-sync requirements.txt

To summarize:

Keep your non-PyPI dependencies in environment.yml
Keep your PyPI dependencies in a requirements.in (or setup.cfg’s install_requires section)
Use pip-tools to lock your dependencies and sync your environment
Check all of them into your version control system (requirements.yml, requirements.in and requirements.txt)

I have only scraped the surface of what pip-tools is capable of and I highly recommend you to read their documentation.

Structuring a Python project

A well-structured project is easier to maintain, package and publish/deploy. There are many resources on this subject, but here’s a simple layout that works well for me:

$ tree -F --dirsfirst
./
├── src/
│   └── __init__.py
├── tests/
│   └── __init__.py
├── environment.yml
├── README.md
├── setup.cfg
└── setup.py

I write all my imports as absolute imports:

from src.subpackage.module import something

unless I’m exposing something in a __init__.py for nicer imports:

# file: src/subpkg/__init__.py
from ._internal_subpkg_module import something_else

__all__ = ["something_else"]

# so that I can:
#   from src.subpkg import something_else
# instead of:
#   from src.subpkg._internal_subpkg_module import something_else

Note that the src folder is the actual package. Rename it to something meaningful if you’re developing a library because that’s what your users are going to write in their imports, no matter what you say the attribute name is in setup.cfg, setup.py or pyproject.toml (actually, you can change the import name – but I find it more error-prone than just using a proper folder name).

I usually don’t bother doing it for my applications, but do it for my libraries (otherwise all my libraries would be imported with a conflicting import src). That’s because I’d rather write import src.whatever than import some_longer_name.whatever – especially since some_longer_name varies per-project.

You can use a meaningful name for both applications and libraries (and probably should, as it’s more easily identifiable by setuptool’s auto-discover – more on this later).

There’s also some recommendations on following a “src layout”, which is basically having a meaningful name and stuffing it inside a folder named src anyway. I also don’t bother, but if you need a quick overview, there’s a very good and short video by Anthony Sottile on the subject, so that at least you’re making an informed decision about it.

I also like tests as a separate package, which makes it easier to not accidentally include them when packaging a final source or wheel distribution.

Moving along. We’ve already covered what the contents of environment.yml should look like. But what about setup.py and setup.cfg, and why both of them? And why not pyproject.toml?.

To get the questions out of the way:

First of all, pip-tools supports all of them; you’re free to choose whichever you prefer.
I like setuptools and it works fine for me. It does not support pyproject.toml yet and I don’t want to use flit, poetry or whatever else supports it (no real reason, just preference), so I use the supported setup.py and setup.cfg
- Be aware that setup.cfg may or will be eventually deprecated in favor of pyproject.toml
setup.cfg is an ini file with plain data, which is easier to parse and manipulate; setup.py is code, and may contain complex logic that is not easy to update programmatically
- That said, you’ll still need a dummy setup.py file, because the setup.cfg alone is meaningless, unless you’re already using a PEP 517-style build with pyproject.toml
If you’re doing anything more complicated (such as compiling non-Python dependencies of your own as part of your package), then you’ll need to stick with just setup.py

# file: setup.cfg
[metadata]
name = mypkg
version = 0.1.0

[options]
packages = find:
install_requires =
    flask

[options.extras_require]
dev =
    pytest
    coverage

[options.packages.find]
exclude =
    tests*

Declare your dependencies under install_requires
Declare development-only dependencies under dev.
- You may create additional dependency groups. Just dev is enough for me.
Let pip-tools do the pinning, unless you want to restrict some known-to-fail versions

You can use find_namespace: instead of find:, if you don’t want to add multiple __init__.py to explicitly transform folders into Python packages. I prefer to make packages explicitly.

You may also entirely omit the attribute packages and the section [options.packages.find] if you use a properly-named flat-layout package (not named src) or a src-layout (with properly named packages inside an src folder) and setuptool’s auto-discover feature will handle those for you. Beware that this feature is in beta (at the time of writing) and may be subject to change in the future.

And the dummy setup.py:

# file: setup.py
from setuptools import setup

setup()

Now that you’re up to speed with this declarative way of defining your package, let’s see how to use them in combination with pip-tools:

pip-compile setup.cfg --resolver backtracking -o requirements.txt
pip-compile setup.cfg --resolver backtracking -o requirements-dev.txt --extra dev

This generates two lock-files: requirements.txt and requirements-dev.txt (both of which should be checked into version control).

“Wait! What are those new options?”

I just want to be explicit about where to get the dependency list from (setup.cfg), which dependency group to consider (base dependencies from install_requires, or the dev group) and where to save it (requirements*.txt).

The backtracking resolver is slower, but will resolve in some situations where legacy (current default) will not. It’s still not default in pip-tools at the time of writing this, but will be eventually.

“But how do I single-source my package version?”

Good thing you asked. I have dabbled with setuptools_scm to extract the package version directly from source control (usually git tags), but eventually settled on just having a single location and source of truth for the package version: setup.cfg. This minimizes the chances of forgetting to update some locations, and also makes it easier to do so automatically with a tool such as bump2version and bumpver (I just do it manually – it’s a single location, after all – but you’re free to try them out and see if they work for you).

Just remember to:

Only set the version number in setup.cfg
Use importlib.metadata to fetch the version from the package name (Python 3.8+ only)
- You’ll need a separate dependency for Python 3.7 and under: importlib-metadata
- If you decide to support versions below 3.8, conditionally declare your importlib-metadata and add safeguards based on the interpreter version before importing one or the other: here’s a reference

# file: mypkg/__init__.py
import importlib.metadata

__version__ = importlib.metadata.version("mypkg")
# "mypkg" above must be the same as metadata.name in setup.cfg

Packaging and Publishing

This process is different for libraries and applications. Whereas with libraries you want to package source and wheel distributions for publishing in a package index (a public one such as PyPI, or a private one under your control), with applications it depends on what kind of application and where it’s going to run.

My most common use case is packaging stateless Python applications as Docker images to be run in a remote host, exposing some functionality via an HTTP API. Docker images themselves are also usually published to a container registry, but that’s well-covered by better resources and outside the scope of this post.

If you need to package a desktop or mobile Python application, the remainder of this article is of no good use for you.

If you’re using version control (you should), remember to tag your versions. For example, in git you can:

# create a git tag for the current version, prefixed by `v`
git -a v`python setup.py --version`

This command will open $EDITOR, where you should include a brief title and description of your release. Upon saving, the git tag will be created.

Packaging & publishing Python Libraries

Fortunately, you don’t need a lot here. All you need is build and twine, respectively, to build your source and wheel distributions, and to publish your package. You don’t really need build, but it helps you get around many common mistakes.

pip install --upgrade build twine

Now is also probably a good time to add both build and twine to your extras_require under the dev dependency group. Then it’s just:

# build
python -m build
# generates:
#   mypkg-version.tar.gz           (source distribution)
#   mypkg-version-py3-none-any.whl (wheel distribution)

# publish
python -m twine upload dist/mypkg-version*  # upload both sdist and wheel of `version`
# use pypi user:pass or __token__:token

The published name in PyPI will be a normalized version of your name attribute in setup.cfg.

“But wait! I want reproducible builds”

You can get that for wheels by setting SOURCE_DATE_EPOCH:

SOURCE_DATE_EPOCH=0 python -m build
# other sources of non-determinism might affect your build's reproducibility

You can check that the md5sum of the generated wheel does not change.

I am not sure if it’s possible to get reproducible source distributions, but I haven’t looked too hard.

Packaging & publishing Python Applications (Docker)

Here I aim for a reasonably lean image (~150MB for a simple Flask app; not great, not too bad either), that is fast to build and cache-friendly. We want fast iteration times, and not having to rebuild the entire virutal environment layer every time a line of code is changed (even if no dependencies were added or removed) is crucial.

This is achieved by separating the virtual environment creation and copying the application code. Shipping the application source and building it in-place (with pip install --use-pep517) is less ideal than building the wheel in a build stage and then copying it over to the final runtime stage. But in my use case it’s often faster and simpler to do it this way, and for me there are no major drawbacks.

Using a virtual environment inside a Docker container is perhaps over-the-top, but it provides extra isolation from the base image’s own Python dependencies, and the entire /venv folder can be copied over between stages if you need.

Remember that if your project contains dependencies from conda’s repositories, you’ll need to create a conda environment instead of a regular Python environment using venv or virtualenv. Using the continuumio/miniconda3 base image will get you rolling much faster than setting it all up by yourself. You might still want to install mamba to improve build times if you have many conda-only dependencies.

# file: Dockerfile
# --- base image -----------------------------------------------------------------------
FROM python:3.10-slim-bullseye as base

ENV \
    PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PYTHONFAULTHANDLER=1 \
    PYTHONHASHSEED=random \
    PIP_DEFAULT_TIMEOUT=100 \
    PIP_DISABLE_PIP_VERSION_CHECK=1 \
    PIP_NO_CACHE_DIR=1

# venv
RUN python -m venv /venv
ENV PATH=/venv/bin:$PATH

# base dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
RUN pip install build

WORKDIR /src

# --- dev stage ------------------------------------------------------------------------
FROM base as dev

# dev-only dependencies
COPY requirements-dev.txt .
RUN pip install -r requirements-dev.txt

COPY myapp /src/myapp
COPY setup.* /src/
RUN pip install . --use-pep517 --no-deps

# --- runtime stage --------------------------------------------------------------------
FROM base as runtime

COPY myapp /src/myapp
COPY setup.* /src/
RUN pip install . --use-pep517 --no-deps

CMD ["python", "-m", "myapp"]

Which you can test with:

docker build . -t $(basename `pwd`)
docker run --rm -it $(basename `pwd`)

Remember to update the Dockerfile to:

Use the correct Python version for your project as a base image
Update the copied paths to reflect your actual package name (rather than myapp)

Also remember to create a .dockerignore to stricten the directories considered in the build context, which should also improve build times:

# file: .dockerignore
# vcs
.git/

# python
.venv/
*.pyc
*.egg-info/
.coverage
.mypy_cache/
.pytest_cache/

# other
notebooks/
tmp/
.env

Conclusion

And that’s a wrap. You’ve climbed halfway through the mountain of Python packaging, and learned to create and manage your environments, properly manage your dependencies, and structure your project in a way that facilitates development, packaging and publishing – either as a library or containerized application.

I originally intended this post to be a quick read, but it turned out to be at least twice as long as I had expected. That might be a testament to how complex is the packaging state for Python currently. Certainly manageable, but requires a lot of digging over the years. At the very least, even though this article is not really a quick read, I wish for it to be a helpful resource that contains one single consistent workflow that covers most of what you’ll need.

Thank you for reading. If something I wrote is inaccurate or you have a better alternative, please do let me know!