Skip to content

Instantly share code, notes, and snippets.

@peterwwillis
Created December 2, 2020 02:07
Show Gist options
  • Save peterwwillis/e96854532f471c739983c0bba7608834 to your computer and use it in GitHub Desktop.
Save peterwwillis/e96854532f471c739983c0bba7608834 to your computer and use it in GitHub Desktop.

Package Management is Inherently Dumb

All packaged software is just a random person trying to guess at how to install and run some random software. The package has to declare what packages it depends on, and what it conflicts with.

The only way for a package to have the correct 'depends' and 'conflicts' is for the original software to ship with an explicit map of all its dependencies and conflicts. No software does this, in part because every Linux distribution ships different packages, and thus has different dependencies and conflicts. And so, we have to build packages by hand. A human (who isn't the software developer) has to determine the correct dependencies and conflicts (based on other packages that this human also did not create). Then they need to build the package and test it.

A package manager (dpkg) is a dumb program that does whatever you tell it to do. A package encodes its own dependencies, and the package manager fulfills the requirements as stated, or fails if it's impossible. There's no extra logic to work around or repair things if something goes wrong. dpkg -i foo.deb will either work or it won't, and it may even work when it shouldn't. It's even dumber than that, because it may execute arbitrary commands every time the package is installed or removed, regardless of what dependencies are or aren't installed.

The package manager frontend (apt) tries to figure out, based on the index of all package dependencies, how to fulfill a general requirement. apt-get install apache2 will try to figure out what packages will fulfill this requirement (and all their requirements) and try to install them. It requires the package repository maintainer (a human) to make sure there are no conflicting dependencies.

In a stable repository, a lot of preprocessing with custom tools is done before adding a new package. Assuming the packages declare their dependencies correctly, and the maintainer prevents conflicts from existing, it should be impossible for the frontend to not properly resolve dependencies. The frontend will add and remove whatever packages necessary to fulfill your immediate ask (apt-get install apache2).

Whether your package manager frontend succeeds or fails, and whether your application runs right or doesn't, is based on whether a human screwed up a package or didn't catch a conflict when adding it to a repo. And there is no solution to all this, because the software didn't ship its explicit dependency map.

With the software we have today, humans have to package it, so packaging will always be buggy.

Aside: Trying to Solve Package Management 15 Years Ago

We didn't used to have package manager frontends. I remember when Yum was a weird new experimental wrapper around RPM. Later I ended up maintaining multiple corporate CentOS releases and package repositories. Later still, we acquired another company that had a gigantic custom framework for managing internal software builds + deployments, using Yum/RPM.

They built the packages with custom versioned file directory hierarchies, and would switch symlinks around to load the right dependencies for a given app at run-time. You could install multiple versions of the same library, and multiple versions of the same app, in a single distro, using Yum/RPM. But it couldn't solve all conflicts between conflicting versions of apps and libraries. To do that, they needed containers.

(Aside to the aside: they also did away with Configuration Management for OSes and app deployments. They used build pipelines to create RPMs of software, and the configuration for each environment it would run in. So there were immutable versioned artifacts for every change. Reliable rollbacks, no state drift. Immutable Software + Infrastructure circa 2004. It was complicated and clunky, but resulted in extremely reliable and simple operations, all automated)

Conflicts are Hard to Solve

Containers were invented for multiple reasons, but their benefit for packaging is their unique filesystem namespace and chroot environment. This solves a dozen different problems, including consolidating dependencies, simplifying the execution environment, wrapping up install-time tasks, providing additional integration instructions, and removing the need for individual packages to have crazy hacks to load different versions of dependencies on the fly - to say nothing of how to execute it all.

The reason package managers can't resolve dependency conflicts is the way dependencies are defined and used by software, both at build-time and run-time. Statically compiling the apps can solve the build-time requirements of some apps, but does not solve the run-time conflicts.

Build-time conflicts

Let's start with a sample app. At build time, app A1(v1) depends on library L1(v1) and L2(v1). Now say that after a while, A1 also wants to use L2(v2), but only for some code. It can't, because the functions, structs, etc of L2(v1) and L2(v2) conflict. There are workarounds, such as using dlopen() to load L2(v2) at runtime and have specific code use (v2) specific symbols. But for that to solve our packaging problems, you'd have to modify literally every app to do that, for all its code, anywhere a library is used in an app.

Static linking allows you to have A1 and A2, each with different versions of L1 and L2, and basically ignore packaging dependencies. But it can't solve A1 wanting to use both L2(v1) and L2(v2). It also doesn't solve dependence on things outside the binary, such as application data on the filesystem that is not compiled into the application.

Run-time conflicts

Say you have SQLite libraries L2(v1) and L2(v2) and keep them in different file paths. You run different apps that use these libraries. App A1 uses L2(v2) to create a database, and app A2 uses L2(v1) to read it. It doesn't work, because A1 and A2 are using conflicting versions of the library. A workaround might be to build A1(v1), A1(v2), A2(v1), A2(v2), and have the user somehow know which combination of apps to run for which functions. But really you just need the apps to use functionally-equivalent dependencies.

If you change a dependency for A2 and don't also then test A1's functionality, you can end up with either L1 not working, or L1 and L2 no longer working together. Statically-linking A1 and A2 can allow them to continue functioning separately in spite of dependency changes, but it doesn't resolve their mutual interoperability issues.

Containers [Mostly] Solve It

In order to solve both the build and run-time conflicts, you make an enclosed environment for each application. Each environment has all its own dependencies. You run each app in its own environment and connect them at their environmental boundaries. And that's how we get containers: they 'statically compile' both the build and runtime into a single immutable environment with no internal conflicts.

Within a container, we can be sure any changes will be compatible. Outside of the container, we still have to ensure their functional boundaries are compatible. This is still an unsolved problem.

'The cloud' runs on containers because of how complex it is to manage randomly-interacting software dependencies of lots of applications at scale. Without them, any change to one application's dependency might need to be tested on every other application. Statically-compiled apps may still have build- and run-time dependency issues; containers contain more of the dependencies by capturing the whole environment. The dependencies' versions are also captured in the container, making them easier to reason about.

So to recap why I think "The only practical improvement you can make to package management is containers":

  1. All software is released without complete dependency maps, so packaging has to be done by humans, so it ends up sucking.
  2. Package managers cannot solve all the conflicts at hand, but containers (or something like them) mostly can.

The only alternative I can see is to completely reinvent how all the software works by modifying the compilers and linkers and source code, and provide explicit mapping between each version of functions, data models, ABIs, etc between all software components. In that way, every block of code could be explicitly version-mapped to its dependent block of code/data/etc. But current software does not work like that. And it's really out of scope of mere package management.

@Animeshz
Copy link

Hey, great POV, including various historical stuffs.

What's your take on nix's approach, it does adhere some of what was the pain point for so long of the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment