Python packaging for developers in a hurry
08 January 2023
Python packaging is hard (I will spare you that xkcd). But it can be manageable if you’ve hit your head on the wall enough times to know what to do and what to avoid.
“Wait a minute, is this guy about to tell me my problems are imaginary and point me to yet another magical tool that does it all?” you ask yourself. No, of course not! That would be psychological abuse.
Sadly, this is a long mountainous road that you’ll mostly have to walk by yourself. I hope that my experience gives you a head start in figuring out what is best for you.
Even though the title says “Python packaging”, I will cover four main topics:
- Managing Python environments
- Managing dependencies
- Structuring a Python project
- Packaging and publishing
If you already have a well-rounded workflow for any of those, feel free to skip some sections (you don’t have a lot of time, after all). I know the entire post is long, but I hope that you can just navigate the sections, find what you need, and go back to work.
If you’re reaaaally in a hurry and just want the code, here’s the repo: giovannipcarvalho/sample-python-project
“Is this for me?”
If you’re overwhelmed by the amount of information on these subjects, this article strives to be a comprehensive summary of most things you’ll need to consider. It also demonstrates a tried and tested workflow that may work well for you.
However, if you have the time and patience, then reading the official resources from PyPA (Python Packaging Authority), the relevant PEPs (Python Enhancement Proposals), and at least one of the many guides on managing Python versions will be a more thorough – albeit longer – approach.
Managing Python environments
In this section I assume we agree that your system’s Python is for your system, not for you. If you rely on it for your project you better be doing something that is fully-compatible with the way your system works.
“I need no dependencies, just a Python interpreter”
Then fine, you can use your system’s pre-installed Python.
If you need different (perhaps multiple) Python versions than what comes with your system, or if need to install new dependencies (which might interfere with your own system’s dependencies and break it) you are better off isolating them with virtual environments.
“My system’s version is okay, but I need to add some dependencies”
Use Python’s builtin venv
to create virtual environments. On some distros it might not come pre-installed so you’ll need to install python3-venv
to get it.
Usage:
python -m venv .venv # create the environment
. .venv/bin/activate # activate it
deactivate # deactivate when you're done
On Windows, I recommend you use git-bash (shipped with git-for-windows) if WSL (Windows Subsystem for Linux) is too slow on your machine.
# on git-bash for windows
python -m venv .venv # same command
. .venv/Scripts/activate # slightly different
deactivate # deactivate when you're done
“venv is too slow or too easy to break”
Python’s builtin venv
greatest advantage is that you don’t need any extra dependencies. If you have problems with it, use virtualenv
instead. More often than not you can just install it from your distro’s repository and be done with.
If not, follow the official instructions. Worst-case scenario, you should be able to materialize it out of thin air using their zipapp, as long as you have Python.
Once installed, the workflow is pretty much the same:
virtualenv .venv # slightly different than venv
. .venv/bin/activate # same activation command as venv
deactivate # deactivate when you're done
“I’m on Windows. HELP!”, or
“My system does not have the Python version I want – or any Python version at all”, or
“I need compiled dependencies (or non-Python dependencies) which are hard to install”, or
“I want a consistent workflow across Linux, Windows and macOS”
Short answer: conda (actually mamba, using the mambaforge distribution)
Long answer:
If you’re disappointed with my answer, I am sorry but our paths have diverged and I am no longer a better guide for you than yourself. If you’re not on Windows, or don’t need compiled and other non-Python dependencies, you might still want to check out pyenv
or asdf
.
Anaconda is an entire Python distribution that comes with MANY packages, including the conda
package and virtual environment manager. I am not recommending Anaconda. Use conda, or rather mamba – the compatible C++ reimplementation that is faster – if you can avoid Anaconda entirely.
The mamba-forge distribution is a lightweight Python distribution that comes with mamba
and conda-forge
pre-configured as the default repository. Pick the installer from here according to your platform and architecture and follow the instructions for unix-like systems or Windows.
note: While
virtualenv
does support multiple Python versions, you need to get them installed first, as it merely finds and uses them to create new environments, but won’t install them for you.
Without spending too much more time justifying my choice:
- mamba (the package manager) officially supports Linux, Windows and macOS (all of which I need to use)
- the conda-forge repository has multiple Python versions, from Python 2.x (God help you) to the more recent 3.x versions
- it’s the only sane way to install MKL/BLAS-accelerated NumPy and CUDA-accelerated PyTorch/Tensorflow that I know of (Gohlke’s builds are not a thing anymore)
- it allows me to rootlessly install non-Python dependencies, such as
gcc
and other language’s toolsets (node
,rust
,go-lang
, etc)
Here’s how to use it:
mamba create -n myproj # create environment
mamba activate myproj # activate it
# get to work
mamba deactivate # deactivate when you're done
An added bonus is that you can create a environment.yml
at the root of your project and simply run:
mamba env create
or mamba env update
.
# file: environment.yml
name: myproj # set this to the actual environment name you want to use
channels:
- conda-forge
dependencies:
- python=3.10
I recommend that you install from conda-forge only the packages that you cannot obtain from PyPI (the official Python package index). That is because mamba is usually slower than pip in resolving dependencies, and because pip has better ways to separately declare direct and transitive dependencies (conda-lock is not as good, IMO). We’ll get to why this is important in the next section.
If you are not going to need to upgrade dependencies (short-term or a one-off project), it’s better to have your environment.yml
include all your dependencies so that you can reproduce it later, if needed (use conda export for that).
If you are going to maintain the project for long enough and expect to upgrade some dependencies, make your environment.yml
contain your non-pip dependencies (such as Python itself) and manage your pip-installable dependencies with a better tool.
note: I think
venv
andvirtualenv
still have their place in this setup: for short-lived or disposable environments where you just want make some small tests (e.g. test if a newer version of a library does what you want, without having to upgrade or mess with your current mamba environment for the project), since they will be usually faster thanmamba
.
Troubleshooting your Python environment
Nine out of ten times, you’re using the wrong python
, the wrong pip
, or the wrong environment altogether.
Take note of the output of the following commands:
which python
which pip
which whichever-other-command-youre-running # e.g. pytest or jupyter
Compare their base paths and identify from which virtual environment they are coming from. Are they different? They shouldn’t be.
Are they the same? Check your $PATH
environment variable for anything Python related.
# use git-bash if on Windows
echo $PATH | tr ':' '\n'
A very common one is Windows 10+’s default Python (somewhere over %APPDATA%/Microsoft/WindowsApps
) taking priority over your desired virtual environment. Get rid of it.
Managing dependencies
I assume you value having a consistent way to reproduce your environment. In essence, you want to:
- Not be susceptible to hard-to-track bugs that only happen with some mysterious combination of dependencies
- i.e. ensure you’re running your application with dependencies you have tested against
- Track your primary dependencies and transitive dependencies (the dependencies of your dependencies)
- Easily and selectively upgrade dependencies, rather than always upgrading all at once
Poetry actually does it all and a bit more (including automatically creating virtual environments for you), but then again, you miss out on packages that are only (or more easily) installable from conda repositories, and in my experience it is very fragile on Windows.
important: If you’re going with Poetry, remember it should live outside your environment – i.e. do not include it in your
environment.yml
, if you plan to use it along with mamba.
I won’t go in details into the other disadvantages of poetry, but I will just emphasize that it has way too many dependencies (it currently pulls in a total 44 dependencies, according to my test in a clean environment).
“What should I use?”
If you’re developing a tool or application, use pip-tools.
If you’re developing a library, use nothing. Let your users decide, and give them as much flexibility as possible to maximize compatibility (don’t pin, unless to exclude some known-to-fail version ranges). In other words, only declare your direct dependencies under install_requires
in your setup.cfg
and let pip do the rest.
pip-tools
is actually a combination of two tools: pip-compile
and pip-sync
. It is very lightweight and only pulls in 6 dependencies (tested in a clean environment). It also produces reasonably human-readable requirements.txt
files, with pinned dependencies followed by a comment showing their parent package:
asgiref==3.2.3
# via django
django==3.0.3
# via -r requirements.in
Which makes identifying why some dependency was installed very easy compared to poetry’s lock file.
Here are the steps:
echo einops >> requirements.in # declares a new dependency
# generate or update requirements.txt with pinned versions
# by default, it pins them to the latest available and compatible versions
pip-compile
# at a later point in time, when you need to selectively upgrade a package
pip-compile --upgrade-package einops
# will upgrade the package to the latest available and compatible version
# sync environment with requirements.txt
pip-sync requirements.txt
To summarize:
- Keep your non-PyPI dependencies in
environment.yml
- Keep your PyPI dependencies in a
requirements.in
(orsetup.cfg
’sinstall_requires
section) - Use
pip-tools
to lock your dependencies and sync your environment - Check all of them into your version control system (
requirements.yml
,requirements.in
andrequirements.txt
)
I have only scraped the surface of what pip-tools is capable of and I highly recommend you to read their documentation.
Structuring a Python project
A well-structured project is easier to maintain, package and publish/deploy. There are many resources on this subject, but here’s a simple layout that works well for me:
$ tree -F --dirsfirst
./
├── src/
│ └── __init__.py
├── tests/
│ └── __init__.py
├── environment.yml
├── README.md
├── setup.cfg
└── setup.py
I write all my imports as absolute imports:
from src.subpackage.module import something
unless I’m exposing something in a __init__.py
for nicer imports:
# file: src/subpkg/__init__.py
from ._internal_subpkg_module import something_else
__all__ = ["something_else"]
# so that I can:
# from src.subpkg import something_else
# instead of:
# from src.subpkg._internal_subpkg_module import something_else
Note that the src
folder is the actual package. Rename it to something meaningful if you’re developing a library because that’s what your users are going to write in their imports, no matter what you say the attribute name
is in setup.cfg
, setup.py
or pyproject.toml
(actually, you can change the import name – but I find it more error-prone than just using a proper folder name).
I usually don’t bother doing it for my applications, but do it for my libraries (otherwise all my libraries would be imported with a conflicting import src
).
That’s because I’d rather write import src.whatever
than import some_longer_name.whatever
– especially since some_longer_name
varies per-project.
You can use a meaningful name for both applications and libraries (and probably should, as it’s more easily identifiable by setuptool’s auto-discover – more on this later).
There’s also some recommendations on following a
“src layout”, which is basically having a meaningful name and stuffing it inside a folder named src
anyway.
I also don’t bother, but if you need a quick overview, there’s a very good and short video by Anthony Sottile on the subject, so that at least you’re making an informed decision about it.
I also like tests
as a separate package, which makes it easier to not accidentally include them when packaging a final source or wheel distribution.
Moving along. We’ve already covered what the contents of environment.yml
should look like. But what about setup.py
and setup.cfg
, and why both of them? And why not pyproject.toml
?.
To get the questions out of the way:
- First of all,
pip-tools
supports all of them; you’re free to choose whichever you prefer. - I like
setuptools
and it works fine for me. It does not supportpyproject.toml
yet and I don’t want to useflit
,poetry
or whatever else supports it (no real reason, just preference), so I use the supportedsetup.py
andsetup.cfg
setup.cfg
is anini
file with plain data, which is easier to parse and manipulate;setup.py
is code, and may contain complex logic that is not easy to update programmatically- That said, you’ll still need a dummy
setup.py
file, because thesetup.cfg
alone is meaningless, unless you’re already using a PEP 517-style build withpyproject.toml
- That said, you’ll still need a dummy
- If you’re doing anything more complicated (such as compiling non-Python dependencies of your own as part of your package), then you’ll need to stick with just
setup.py
# file: setup.cfg
[metadata]
name = mypkg
version = 0.1.0
[options]
packages = find:
install_requires =
flask
[options.extras_require]
dev =
pytest
coverage
[options.packages.find]
exclude =
tests*
- Declare your dependencies under
install_requires
- Declare development-only dependencies under
dev
.- You may create additional dependency groups. Just
dev
is enough for me.
- You may create additional dependency groups. Just
- Let pip-tools do the pinning, unless you want to restrict some known-to-fail versions
You can use find_namespace:
instead of find:
, if you don’t want to add multiple __init__.py
to explicitly transform folders into Python packages. I prefer to make packages explicitly.
You may also entirely omit the attribute packages
and the section [options.packages.find]
if you use a properly-named flat-layout package (not named src
) or a src-layout (with properly named packages inside an src
folder) and setuptool’s auto-discover feature will handle those for you. Beware that this feature is in beta (at the time of writing) and may be subject to change in the future.
And the dummy setup.py
:
# file: setup.py
from setuptools import setup
setup()
Now that you’re up to speed with this declarative way of defining your package, let’s see how to use them in combination with pip-tools
:
pip-compile setup.cfg --resolver backtracking -o requirements.txt
pip-compile setup.cfg --resolver backtracking -o requirements-dev.txt --extra dev
This generates two lock-files: requirements.txt
and requirements-dev.txt
(both of which should be checked into version control).
“Wait! What are those new options?”
I just want to be explicit about where to get the dependency list from (setup.cfg
), which dependency group to consider (base dependencies from install_requires
, or the dev
group) and where to save it (requirements*.txt
).
The backtracking
resolver is slower, but will resolve in some situations where legacy
(current default) will not. It’s still not default in pip-tools
at the time of writing this, but will be eventually.
“But how do I single-source my package version?”
Good thing you asked. I have dabbled with setuptools_scm
to extract the package version directly from source control (usually git tags), but eventually settled on just having a single location and source of truth for the package version: setup.cfg
. This minimizes the chances of forgetting to update some locations, and also makes it easier to do so automatically with a tool such as bump2version and bumpver (I just do it manually – it’s a single location, after all – but you’re free to try them out and see if they work for you).
Just remember to:
- Only set the version number in
setup.cfg
- Use
importlib.metadata
to fetch the version from the package name (Python 3.8+ only)- You’ll need a separate dependency for Python 3.7 and under:
importlib-metadata
- If you decide to support versions below 3.8, conditionally declare your
importlib-metadata
and add safeguards based on the interpreter version before importing one or the other: here’s a reference
- You’ll need a separate dependency for Python 3.7 and under:
# file: mypkg/__init__.py
import importlib.metadata
__version__ = importlib.metadata.version("mypkg")
# "mypkg" above must be the same as metadata.name in setup.cfg
Packaging and Publishing
This process is different for libraries and applications. Whereas with libraries you want to package source and wheel distributions for publishing in a package index (a public one such as PyPI, or a private one under your control), with applications it depends on what kind of application and where it’s going to run.
My most common use case is packaging stateless Python applications as Docker images to be run in a remote host, exposing some functionality via an HTTP API. Docker images themselves are also usually published to a container registry, but that’s well-covered by better resources and outside the scope of this post.
If you need to package a desktop or mobile Python application, the remainder of this article is of no good use for you.
If you’re using version control (you should), remember to tag your versions. For example, in git you can:
# create a git tag for the current version, prefixed by `v`
git -a v`python setup.py --version`
This command will open $EDITOR
, where you should include a brief title and description of your release. Upon saving, the git tag will be created.
Packaging & publishing Python Libraries
Fortunately, you don’t need a lot here. All you need is build
and twine
, respectively, to build your source and wheel distributions, and to publish your package. You don’t really need build
, but it helps you get around many common mistakes.
pip install --upgrade build twine
Now is also probably a good time to add both build
and twine
to your extras_require
under the dev
dependency group. Then it’s just:
# build
python -m build
# generates:
# mypkg-version.tar.gz (source distribution)
# mypkg-version-py3-none-any.whl (wheel distribution)
# publish
python -m twine upload dist/mypkg-version* # upload both sdist and wheel of `version`
# use pypi user:pass or __token__:token
The published name in PyPI will be a normalized version of your name
attribute in setup.cfg
.
“But wait! I want reproducible builds”
You can get that for wheels by setting SOURCE_DATE_EPOCH
:
SOURCE_DATE_EPOCH=0 python -m build
# other sources of non-determinism might affect your build's reproducibility
You can check that the md5sum
of the generated wheel does not change.
I am not sure if it’s possible to get reproducible source distributions, but I haven’t looked too hard.
Packaging & publishing Python Applications (Docker)
Here I aim for a reasonably lean image (~150MB for a simple Flask app; not great, not too bad either), that is fast to build and cache-friendly. We want fast iteration times, and not having to rebuild the entire virutal environment layer every time a line of code is changed (even if no dependencies were added or removed) is crucial.
This is achieved by separating the virtual environment creation and copying the application code. Shipping the application source and building it in-place (with pip install --use-pep517
) is less ideal than building the wheel in a build stage and then copying it over to the final runtime stage. But in my use case it’s often faster and simpler to do it this way, and for me there are no major drawbacks.
Using a virtual environment inside a Docker container is perhaps over-the-top, but it provides extra isolation from the base image’s own Python dependencies, and the entire /venv
folder can be copied over between stages if you need.
Remember that if your project contains dependencies from conda’s repositories, you’ll need to create a conda environment instead of a regular Python environment using venv or virtualenv. Using the continuumio/miniconda3 base image will get you rolling much faster than setting it all up by yourself. You might still want to install mamba
to improve build times if you have many conda-only dependencies.
# file: Dockerfile
# --- base image -----------------------------------------------------------------------
FROM python:3.10-slim-bullseye as base
ENV \
PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PYTHONFAULTHANDLER=1 \
PYTHONHASHSEED=random \
PIP_DEFAULT_TIMEOUT=100 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PIP_NO_CACHE_DIR=1
# venv
RUN python -m venv /venv
ENV PATH=/venv/bin:$PATH
# base dependencies
COPY requirements.txt .
RUN pip install -r requirements.txt
RUN pip install build
WORKDIR /src
# --- dev stage ------------------------------------------------------------------------
FROM base as dev
# dev-only dependencies
COPY requirements-dev.txt .
RUN pip install -r requirements-dev.txt
COPY myapp /src/myapp
COPY setup.* /src/
RUN pip install . --use-pep517 --no-deps
# --- runtime stage --------------------------------------------------------------------
FROM base as runtime
COPY myapp /src/myapp
COPY setup.* /src/
RUN pip install . --use-pep517 --no-deps
CMD ["python", "-m", "myapp"]
Which you can test with:
docker build . -t $(basename `pwd`)
docker run --rm -it $(basename `pwd`)
Remember to update the Dockerfile
to:
- Use the correct Python version for your project as a base image
- Update the copied paths to reflect your actual package name (rather than
myapp
)
Also remember to create a .dockerignore
to stricten the directories considered in the build context, which should also improve build times:
# file: .dockerignore
# vcs
.git/
# python
.venv/
*.pyc
*.egg-info/
.coverage
.mypy_cache/
.pytest_cache/
# other
notebooks/
tmp/
.env
Conclusion
And that’s a wrap. You’ve climbed halfway through the mountain of Python packaging, and learned to create and manage your environments, properly manage your dependencies, and structure your project in a way that facilitates development, packaging and publishing – either as a library or containerized application.
I originally intended this post to be a quick read, but it turned out to be at least twice as long as I had expected. That might be a testament to how complex is the packaging state for Python currently. Certainly manageable, but requires a lot of digging over the years. At the very least, even though this article is not really a quick read, I wish for it to be a helpful resource that contains one single consistent workflow that covers most of what you’ll need.
Thank you for reading. If something I wrote is inaccurate or you have a better alternative, please do let me know!