Container Native Package System

Wed, Jul 1, 2015

A lot of exciting things happened at Dockercon 2015 last week. For me the most exciting was the announcement of the Open Container Project foundation. Not only is it great to see the community come together under one banner but also as a chance to entertain new ideas in this space.

These are some thoughts on how to improve what we consider a “container image.” I’m looking at both the container format itself and what goes on inside of it. This obviously builds on ideas in other systems and I’ve tried to call those out. These thoughts are still early so I’m hoping to find others of like mind and start a good discussion.

History and Context

The thing I’ve been exploring most recently is the intersection between the container image format and package management. While there has been plenty of attention on base OSs to host the container (CoreOS, RancherOS, Project Atomic, Snappy Ubuntu) and efforts to coordinate a cluster of hosts (Kubernetes, Mesosphere, Docker Swarm) we haven’t paid as much attention as we could to what is going on inside the container.

Docker Images are great. Images are pretty efficient to push and pull and, with new focus on security, it is getting easier and easier to be sure that what you want in the image is actually what you are running.

Dockerfiles are also great. They are a purpose built makefile analog that are super easy to understand and logically build on the layered nature of Docker images. Like most of the Docker project, they are much more approachable than other efforts in this area and solve real customer needs. When constructed appropriately, they allow for an efficient dev flow where many of the time consuming steps can be reused.

One of the best innovations of Docker is actually a bit of an awesome hack. It leverages the package managers for existing Linux distributions. Reusing the package manager means that users can read any number of guides on how to get software installed and easily translate it into a Dockerfile.

Think of it this way: a typical Linux distribution is 2 things. First is a bunch of stuff to get the box booted. Second is a package manager to install and manage software on that box. Docker images typically only need the second one. The first one is along for the ride even if the user never needs it. There are package managers out there that are cleanly factored from the underlying OS (Homebrew, Nix) but they aren’t typically used in Docker images.

Problems

This all mostly works okay. There is some cruft in the images that can easily be ignored and is “cheap” as the download and storage cost is amortized because of layer sharing for Docker images.

But we can do better.

Problems with the current state of the world:

  • No package introspection. When the next security issue comes along it is difficult to easily see which images are vulnerable. Furthermore, it is hard to write automated policy to prevent those images from running.
  • No easy sharing of packages. If 2 images install the same package, the bits for that package are downloaded twice. It isn’t uncommon for users to construct complicated “inheritence” chains to help work around this issue1.
  • No surgical package updating. Updating a package requires recreating an image and all re-running all downstream actions in the Dockerfile. If users are good about tracking which sources go into which image2, it should be possible to just update the package but that is difficult and error prone.
  • Order dependent image builds. Order matters in a Dockerfile — even when it doesn’t have to. Often times two actions have zero interaction with each other. But Docker has no way of knowing that so must assume that every action depends on everything coming previously.
  • Package manager cruft. Most well built Dockerfiles have something like:

    RUN apt-get update \
      && apt-get install -y --no-install-recommends \
        build-essential \
      && rm -rf /var/lib/apt/lists/*
    

    This helps to minimize the size of the layer on disk. This is confusing boilerplate that is likely just cargo-culted by many users.

Ideas for Solutions

While I don’t have a fully formed solution to all of these problems, I do have a bunch of ideas.

First, imagine that we take the idea of a container image and break it down a little. The first thing we define is a FileSet. A FileSet is a named, versioned and verified set of files. Google has a system internally called the “Midas Package Manager” (MPM) that this3. Dinah McNutt gave a great talk on MPM at a 2013 USENIX conference. A further tweak would allow a FileSet to import other FileSets into the file namespace of the host. This allows for a FileSet to have multiple “parents” – unlike the current Docker image format.

Second, we define a Package as a type of FileSet. It would have a standard directory structin and include metadata on other packages required along with simple instructions to “install” the package4.Ideally, these packages would be built from verified sources and a verified tool chain. This would enable true provenance for every bit. This would be optional.

Finally, we would redefine a ContainerImage also as a type of FileSet that has metadata necessary to make it runnable. The definition of this metadata is a big part of what the Docker Image format and the ACI format are.

A ContainerImage that is using this container native package system would define a set of read-only imports of all required packages FileSets. Image construction tools would verify that all dependencies are satisfied. Furthermore, the install steps would be run to symlink all of the packages into the appropriate places5.

User code could either be packed up as a Package or just inserted directly into the ContainerImage.

Analysis

By creating a container friendly packaging system and expanding the idea of what an container image is, we can solve most of the issues outlined again.

  • The list of FileSets imported into, say, /packages would be the list of all packages versions that are included in that image.
  • Individual FileSets could be cached by hosts and easily and safely shared between disparate images.
  • A package could be updated in a straightforward way. The toolset would have to make sure that all dependencies are satisfied and that the install steps are run as necessary.
  • Image build tools would list the packages necessary and order wouldn’t matter. Because there are multiple “parents” to an image, order cannot matter.
  • The package install cruft (archived version of the package) would be handled on the host side similar to images themselves. The only thing the container would see would be the actual files – and they would be symlinked in.

There are some missing and underdefined parts to this story.

  • How are packages created? I’m thinking that we could do that by running a container with build time packages that produces output file into a specific directory. Files in that directory are then used to create a package. As part of this the inputs into the build container could be included in the package metadata and signed.
  • What does a package distribution look like? I imagine we’d have curated sets of packages that are known to work well together. For instance, xyz.com could create xyz.com/apache that depends on xyz.com/openssl.
  • How do users override packages? Perhaps abc.com/openssl could specify that it can be used in place of xyz.com/openssl. Any guarantees by xyz.com would be void but it would be a way to do custom versions and carry patches.
  • Opportunity: Kernel and capability requirements. Packages could specify their requirements in a way that would be visible to the host. This would provide a more direct requirement chain between the host and the code running in the container.

This solution obviously borrows from both Homebrew and Nix. What I think is new here is the idea of expanding the definition of an container image with FileSets and making this be fundamentally decentralized. We also need to ensure that the easy to approach spirit of Dockerfile isn’t lost. If we do this right we can make images much easier to efficiently create, verify, update and manage.

Ping back to me on twitter (@jbeda) or we can talk over on Hacker News

(Thanks to Kelsey Hightower for reviewing an earlier version of this post.)


  1. The standard golang image is a great example of this. golang:1.4.2buildpack-deps:jessie-scmbuildpack-deps:jessie-curldebian:jessie. Most of this is done to enable efficient sharing of installed packages. [return]
  2. Best practices should be to track every single input into your docker file. That means that if you are pushing sources you should know which git commit, for example, those sources come from. My guess is that this is rarely done. [return]
  3. Actually, we need our system to be decentralized. MPM, like may package management systems (including Homebrew and Nix) has a single central repository/database of all packages. Whatever is used here must be distributed — probably in a namespace rooted with DNS. Something like Docker Notary would play a role in signing and verifying packages. Something like the Nix archive format (NAR) will help make this more stable and predictable. [return]
  4. Package install should consist of simply symlinking files into some common directories (/bin, /lib). This would all be done via a declarative manifest. There are probably going to be cases where an “install” is a little bit more complex and a script is necessary. I’d love to see how far we could get before that becomes absolutely necessary. It is also assumed that the package directories themselves are only ever mounted read only. [return]
  5. There is a choice on when the package install happens. It could happen early as the container is created. Or it could happen late as part of the container start process. I’d prefer late binding as it makes surgical package updating simpler. The directories that store the symlinks could be tmpfs directories to keep this all very speedy. [return]

A New Beginning...

Wed, Jun 10, 2015

Welcome to the new 80%!

I’ve rewritten my ancient blog on top of Hugo using Bootstrap and Google Fonts. The whole thing is hosted on GitHub if you want to check out the source.

For headlines, I’m using Economica. I picked it because it is narrow and I tend to write long headlines. The body text is Lora. I like a nice Serif font for the body text and it has a bit of style while still being very readable. Code is in Roboto Mono and the Logo and Navigation is in Coda.

I’m not in love with the color theme I have now (dark gray background with red and blue highlights) but it’ll do for now. I played with some CSS gradients to show a preview on the main index page. I’m still not sure – it may be a little too cute.

I’ve done the work of importing my old blog posts and making Hugo create compatible URLs. My original system used XML files for each day so I wrote a short Go program to convert from the XML to something Hugo can consume (markdown files with a JSON header). I had to introduce a verbatim HTML shortcode to make Hugo pass through the HTML from the old blog directly.

The other tricky thing was to generate a page per day instead of a page per post for these old posts. Rolling things up on a daily basis was the way things were done back when I started the blog in January 2003. Doing this daily roll up used the taxonomy feature of Hugo where the taxonomy was “archive” and the term was the day. This was easy to generate from the Go conversion program.

I’m hosting this on GCE under Docker and NGINX. I’m going to write up a post on how I’m doing that and automatically syncing it to the git repro on each submit.

Now that I have it all set up, we’ll see if I actually have anything to say…

HOWTO: Installing Highpoint Rocketraid 222x on Ubuntu Dapper (6.06 LTS)

Tue, Sep 18, 2007

I haven't updated my blog in forever and I'm probably going to abandon my homebrew static client generated site for something like Mephisto at some point but I haven't had time to make the transition.

In the meantime, I'd like to save people some pain and document the steps I have go through to upgrade my Highpoint RocketRaid 2220 on Linux.  I installed the driver a while ago and don't remember the exact steps for that, so this is just what I do to upgrade.  I wrote a little shell script:

#! /bin/sh -v

# Update this version every time you upgrade.
VER=2.6.15-29-686

# To update the highpoint driver:
# 1) Download latest highpoint driver from: 
#
# http://www.highpoint-tech.com/BIOS_Driver/rr222x/Linux/
#
# 2) Patch driver by changing wrong kernel #ifdefs in osm/linux/os_linux.c  
#   KERNEL_VERSION(2,6,15) -> KERNEL_VERSION(2,6,16)

# Disable sata_mv.ko by moving it to a new directory.  This driver
# conflicts with highpoint driver.  I don't know if this is the
# "right" way to do this, but it works.
sudo mkdir -p /lib/modules/${VER}.disabled
sudo mv /lib/modules/${VER}/kernel/drivers/scsi/sata_mv.ko /lib/modules/${VER}.disabled/

# Make sure kernel headers are installed
sudo apt-get install linux-headers-${VER}

# Make new hpt driver:
cd ~jbeda/sources/rr222x-linux-src-1.07/product/rr2220/linux/
make KERNELDIR=/lib/modules/${VER}/build
sudo make install KERNELDIR=/lib/modules/${VER}/build

# make a new ramfs
# mkinitramfs -o /boot/initrd.img-2.6.15-27-686 /lib/modules/2.6.15-27-686
sudo dpkg-reconfigure linux-image-${VER}

Good luck. I hope this helps people out there that are stuck with this thing. I'm still looking for a good cheap solution to host lots of SATA drives on Linux.  Port multipliers are there and aren't as cheap as they should be.  The driver situation is pretty dire and there aren't that many non RAID (fake or not) >4 ports out there.  I haven't tried any SAS cards though -- perhaps the situation there is better.  I'm also running an LSI MegaRAID SATA 300-8XLP with the megaraid driver.  It wasn't as cheap but at least it works with a true open driver. 

"Avalon marks the end of the American Dream"

Thu, Aug 3, 2006

Miguel de Icaza says "Avalon marks the end of the American Dream."  He also compares it to J2EE -- apparently implying that it is overly complex and overarchitected.

Ouch.

While I wouldn't put it that way, I can't disagree.  I left Microsoft almost two years ago and Avalon still hasn't shipped.  A 5+ year ship cycle for a project can't be seen as anything but a sign that something is horribly wrong.  When I was on Avalon we kept talking about building an API for the next 10 years.  Apparently, when Avalon ships there will be 5 years left on that clock.

I take partial responsibility for this.  When we were first starting Avalon, I was all about "Go big or go home" and "We should build something only Microsoft can build."  In retrospect, the project and the company might have been better served by starting with a much smaller team, aiming lower to start and shipping 5 times over those 5 years.  Version 1 might not have been that impressive, but relentless improvement would have built something better factored, simpler, and more in tune with what users actually need.

I named this blog "eightypercent" in honor of the 80% rule.  It just so happens there there are lots of 80% rules to apply.  In this case, a simpler system that only solved 80% of the problem would have been good enough and would have shipped multiple times already.

It looks like the WPF/E project is an effort to strip Avalon down to something much more approachable.  Cross platform, no full CLR, lower memory footprint -- sounds a lot like Flash/Flex.  I know some of the guys working on the project and I have high hopes that it will be something interesting.  The only question, when will it ship?

Seattle's Homeless Alcoholics on NPR

Wed, Jul 19, 2006

Coming in this morning, I heard a segment on NPR covering a unique program that King Country is running to provide rooms for homeless alcoholics in Seattle.  The unique and controversial part of this program is that the residents can continue to drink.

My wife, Rachel, has first hand experience with this problem from her work at the Harborview ER.  Some of these "frequent fliers" are indeed part of the community at the ER. In fact, one of the patients that Rachel had lots of interactions with (ever since she was a med student!) recently died and it really shook her up.  Anecdotally, Rachel has seen this program provide a positive impact.

In any case, this novel program seems to me to be a unique way to approach a very difficult problem.  It reduces the cost to the taxpayers and provides a safe place for these individuals.  Obviously we would all like to see these problems solved, but, failing that, at least the county is trying to manage it.

1 to 30

Fri, Jul 14, 2006

Check this out.  Get to 30.  Your day is now shot :)

(I just finished.)

Custom Weighted Vests

Tue, Jul 4, 2006

Happy fourth!

I just wanted to post a note bragging about my sister, Jill.  The Chicago Tribune just ran an article about her and her daughter Ellie.  Ellie has what they are calling "Sensory Integration Disorder."  She is basically really hyper and needs to jump, spin, rock, swing, etc.  I think I probably had a touch of something similar when I was small.  One of the things that helps Ellie has been a weighted vest.  Jill couldn't find any she and Ellie liked and so made one herself.  It turned out so well, she decided to start making them for others. 

I've helped her get a web site up and running to show off what she has done.  Check it out at www.customweightedvests.com. You can also check out some other custom sewing she does at www.stitchessosweet.com

Link-Backup v 0.6

Mon, Jun 19, 2006

Scott Ludwig and I have released a new version of his python based backup script, Link-Backup.  This new version ignores broken symlinks and has an option to ensure that only one backup at a time is going on.

Also with this release is a cgi script (viewlb.cgi) for exploring the backups.  This makes it easy to keep tabs on what is going on.  Scott wrote this script a while ago and I updated and improved it.

Details here.

Annie loves flowers

Sat, Jun 3, 2006

I haven't been posting anything lately because I've been so busy with work and Annie.  It is funny how life gets in the way.

In any case, she really loves flowers.  We try and go on a walk in the Washington Park Arboretum every day.  She loves to hold and pick little flowers.  Here is a quick photo I snapped of her today:

Speaking of the Arboretum, almost every time I go there, I see the official bird of Seattle, the Great Blue Heron.  Here is a snap of one of those guys from a couple of weeks ago.  He has just caught himself some dinner.

Backing up data

Sat, Apr 15, 2006

Between photos of Annie, other photography and music, I have quite a bit of data.  Probably over 300GB now and growing fast.

Backing up this much data is a challenge.  I've set some requirements of for what I needed in a backup system:

  • Automatic.  A backup system is useless if you don't actually use it.  I know that if I'm required to do anything manually it probably won't get done.
  • Off site.  There are too many things that can happen if you have an on site backup.  Your house can burn down, you can get robbed, etc.
  • Virus resistant. If somehow a virus were to try and wipe everything out, I'd like my data to survive.
  • Versioned.  In case I delete something and don't realize right away, I'd like to keep versions of my data.
  • Big. I have lots of data and I don't imagine that I'm going to stop accumulating it.  I'd like to plan for 800GB-1TB.
  • On line.  I want to be able to get to my backup without mounting/moving anything.

This is a pretty tall order.  Here are some things that I considered but decided not to do:

  • RAID. Mirroring is pretty secure in the face of hardware failure.  RAID 5 sounds great but stories I hear is that if one drive goes there is a serious chance another will go during the rebuilding.  Simply doing RAID also fails to satisfy many of the requirements above.
  • DVD.  I started backing some stuff up on DVDs.  I got read errors a month later.  Too manual also.
  • External Harddrives. I thought about having a couple of really big external drives and rotating them.  This solves a lot of the problems above but it is pretty manual.  A friend of mine is going to do this and keep a copy at the office.  He said he planned to backup/move every couple of months.  That is just too long for me.
  • NT Volume Shadow Copy. This is a cool technology that can keep snapshots through time on a drive.  It looks like the Linux Volume Manager (LVM) can do some similar things.  This isn't really a backup solution as much as a versioning solution.  This plus RAID 0 is probably pretty good except it isn't off site.

I ended up going with a more brute force solution.  I bought and set up two Linux servers.  The first is a home server.  The second is in a datacenter in downtown Seattle.

The home Linux server is a dual core Pentium running Ubuntu 5.10 server.  I have a HW/SW raid card driving 5 250GB drives in RAID 5.  It servers a data share over Samba.  I backup that share to another directory every night.  That archive directory is shared out via Samba also, but as read-only.  This makes the situation fairly virus proof.  Since I have two copies of the data (one that is r/w and the other that is a backup) I might have to add more drives in the future.  I got a big honking case so I'll have room.  I'm running slimserver on this amoung other things.  I also use it to do long running batch enblend stitching jobs for panoramas.

The data center machine is a 1U Tyan server with 4x320GB drives in Linux SW RAID 5.  It is also running Ubuntu 5.10 server.  (Getting Ubuntu installed on a RAID5 drive array was a challenge.  I don't remember the exact steps I took or I would document them here.)  The cost of doing this can be high unless you have a friend that can get you hooked up.  Even if you can't, it might be worthwhile.  I'm backing up to this every night over my cable modem.  The cable modem upload speed is okay since I upgraded to Comcast's higher level of service.  I now have ~768kbps up.

There are a couple of choices for software for doing the backups.  The key is that it has to be bandwidth smart (only update diffs), handle versioning gracefully and be able to do partial copies based on a timer.  The last requirement is so that if I have a ton of data to upload it can go over a couple of nights.  The most obvious candidate for this is rsync.  rsync is an amazing tool for these types of things.  My MS friends think that robocopy is cool -- it can't hold a candle to rsync.  It also generally operates over ssh so it is also secure. It can also build version trees where unchanged files are hardlinked to previous versions.  The only thing it can't do is handle files that have moved but haven't changed, and stop itself after a certain number of minutes.

To solve this problem, Scott Ludwig (a coworker of mine who has a similar setup) developed a python script that does much of what rsync does but solves these two last problems.  It is called "link backup" and you can get it here.  Every night starting at 11:30pm, I run this backup script to backup my working directory to the backup directory on my home server.  This is usually pretty quick.  I then backup the latest snapshot of that backup set to the server in the datacenter.  This can take a little longer but at least I don't have to think about it.

Since I implemented this, Amazon's S3 has come on the scene.  While it might be interesting to back up using S3, I'm not sure how the economics work out.  At my current level (300GB) and Amazon's pricing model ($0.15 per GB per month) I would be paying $45 per month.  If I grow to 600GB, I'm up to $90 per month.  Bandwidth is extra, but I don't use much of that.  It should be easy to find colocation hosting for that amount of coin.  There is the one time cost of the hardware and the work of keeping it up to date also (I think my server came in around $1600 but I probably overbought).  The advantage is that you can run and do other things with your own server in a datacenter.

[April 16, 2006: Edited to fix Ubuntu version number.]