Container Native Package System

A lot of exciting things happened at Dockercon 2015 last week. For me the most exciting was the announcement of the Open Container Project foundation. Not only is it great to see the community come together under one banner but also as a chance to entertain new ideas in this space.

These are some thoughts on how to improve what we consider a “container image.” I’m looking at both the container format itself and what goes on inside of it. This obviously builds on ideas in other systems and I’ve tried to call those out. These thoughts are still early so I’m hoping to find others of like mind and start a good discussion.

History and Context

The thing I’ve been exploring most recently is the intersection between the container image format and package management. While there has been plenty of attention on base OSs to host the container (CoreOS, RancherOS, Project Atomic, Snappy Ubuntu) and efforts to coordinate a cluster of hosts (Kubernetes, Mesosphere, Docker Swarm) we haven’t paid as much attention as we could to what is going on inside the container.

Docker Images are great. Images are pretty efficient to push and pull and, with new focus on security, it is getting easier and easier to be sure that what you want in the image is actually what you are running.

Dockerfiles are also great. They are a purpose built makefile analog that are super easy to understand and logically build on the layered nature of Docker images. Like most of the Docker project, they are much more approachable than other efforts in this area and solve real customer needs. When constructed appropriately, they allow for an efficient dev flow where many of the time consuming steps can be reused.

One of the best innovations of Docker is actually a bit of an awesome hack. It leverages the package managers for existing Linux distributions. Reusing the package manager means that users can read any number of guides on how to get software installed and easily translate it into a Dockerfile.

Think of it this way: a typical Linux distribution is 2 things. First is a bunch of stuff to get the box booted. Second is a package manager to install and manage software on that box. Docker images typically only need the second one. The first one is along for the ride even if the user never needs it. There are package managers out there that are cleanly factored from the underlying OS (Homebrew, Nix) but they aren’t typically used in Docker images.

Problems

This all mostly works okay. There is some cruft in the images that can easily be ignored and is “cheap” as the download and storage cost is amortized because of layer sharing for Docker images.

But we can do better.

Problems with the current state of the world:

No package introspection. When the next security issue comes along it is difficult to easily see which images are vulnerable. Furthermore, it is hard to write automated policy to prevent those images from running.
No easy sharing of packages. If 2 images install the same package, the bits for that package are downloaded twice. It isn’t uncommon for users to construct complicated “inheritence” chains to help work around this issue¹.
No surgical package updating. Updating a package requires recreating an image and re-running all downstream actions in the Dockerfile. If users are good about tracking which sources go into which image², it should be possible to just update the package but that is difficult and error prone.
Order dependent image builds. Order matters in a Dockerfile — even when it doesn’t have to. Often times two actions have zero interaction with each other. But Docker has no way of knowing that so must assume that every action depends on all preceding actions.
Package manager cruft. Most well built Dockerfiles have something like:
```
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
    build-essential \
  && rm -rf /var/lib/apt/lists/*
```
This helps to minimize the size of the layer on disk. This is confusing boilerplate that is likely just cargo-culted by many users.

Ideas for Solutions

While I don’t have a fully formed solution to all of these problems, I do have a bunch of ideas.Imagine that we take the idea of a container image and break it down a little.

The first thing we define is a FileSet. A FileSet is a named, versioned and verified set of files. Google has a system internally called the “Midas Package Manager” (MPM) that does this³. Dinah McNutt gave a great talk on MPM at a 2013 USENIX conference. A further tweak would allow a FileSet to import other FileSets into the file namespace of the host. This allows for a FileSet to have multiple “parents” – unlike the current Docker layered image format.

Second, we define a Package as a type of FileSet. It would have a standard directory structure and include metadata on other packages required along with simple instructions to “install” the package⁴. Ideally, these packages would be built from verified sources and a verified tool chain to enable true provenance for every bit. This would be optional.

Finally, we would redefine a ContainerImage also as a type of FileSet that has metadata necessary to make it runnable. The definition of this metadata is a big part of what the Docker Image format and the ACI format are.

A ContainerImage that is using this container native package system would define a set of read-only imports (using the FileSet import feature described above) of all required packages FileSets. Image construction tools would verify that all package dependencies are satisfied. Furthermore, the install steps would be run to symlink⁵ all of the packages into the appropriate places⁶.

User code could either be packed up as a Package or just inserted directly into the ContainerImage.

Analysis

By creating a container friendly packaging system and expanding the idea of what an container image is, we can solve most of the issues outlined again.

The list of FileSets imported into, say, /packages would be the list of all packages versions that are included in that image.
Individual FileSets could be cached by hosts and easily and safely shared between disparate images.
A package could be updated in a straightforward way. The toolset would have to make sure that all dependencies are satisfied and that the install steps are run as necessary.
Image build tools would list the packages necessary and order wouldn’t matter. Because there are multiple “parents” to an image, order cannot matter.
The package install cruft (archived version of the package) would be handled on the host side similar to images themselves. The only thing the container would see would be the actual files – and they would be symlinked in.

There are some missing and underdefined parts to this story.

How are packages created? I’m thinking that we could do that by running a container with build time packages that produces output file into a specific directory. Files in that directory are then used to create a package. As part of this the inputs into the build container could be included in the package metadata and signed.
What does a package distribution look like? I imagine we’d have curated sets of packages that are known to work well together. For instance, xyz.com could create xyz.com/apache that depends on xyz.com/openssl.
How do users override packages? Perhaps abc.com/openssl could specify that it can be used in place of xyz.com/openssl. Any guarantees by xyz.com would be void but it would be a way to do custom versions and carry patches.
Opportunity: Kernel and capability requirements. Packages could specify their requirements in a way that would be visible to the host. This would provide a more direct requirement chain between the host and the code running in the container.

This solution obviously borrows from both Homebrew and Nix. What I think is new here is the idea of expanding the definition of an container image with FileSets and making this be fundamentally decentralized. We also need to ensure that the easy to approach spirit of Dockerfile isn’t lost. If we do this right we can make images much easier to efficiently create, verify, update and manage.

Ping back to me on twitter (@jbeda) or we can talk over on Hacker News

(Thanks to Kelsey Hightower for reviewing an earlier version of this post.)

The standard golang image is a great example of this. golang:1.4.2 → buildpack-deps:jessie-scm → buildpack-deps:jessie-curl → debian:jessie. Most of this is done to enable efficient sharing of installed packages. ↩
Best practices should be to track every single input into your docker file. That means that if you are pushing sources you should know which git commit, for example, those sources come from. My guess is that this is rarely done. ↩
Actually, we need our system to be decentralized. MPM, like may package management systems (including Homebrew and Nix) has a single central repository/database of all packages. Whatever is used here must be distributed — probably in a namespace rooted with DNS. Something like Docker Notary would play a role in signing and verifying packages. Something like the Nix archive format (NAR) will help make this more stable and predictable. ↩
Package install should consist of simply symlinking files into some common directories (/bin, /lib). This would all be done via a declarative manifest. There are probably going to be cases where an “install” is a little bit more complex and a script is necessary. I’d love to see how far we could get before that becomes absolutely necessary. It is also assumed that the package directories themselves are only ever mounted read only. ↩
Josh Wood (@joshixisjosh9), via twitter, points out some issues with using symlinks. An alternative here would be to use bind mounts. But it is unclear how many bind mounts Linux can handle (100 containers with 100 bind mounts = 10,000 bind mounts) and setting them up requires root privledges. ↩
There is a choice on when the package install happens. It could happen early as the container is created. Or it could happen late as part of the container start process. I’d prefer late binding as it makes surgical package updating simpler. The directories that store the symlinks could be tmpfs directories to keep this all very speedy. ↩