80%

Another Leap: Heptio

Wed, Nov 16, 2016

As I write this it is Monday evening and I’m actually in an office. Craig McLuckie and I have been hanging around the Madrona Venture Group offices here in Seattle. We started a new company together named Heptio.

We are announcing that we have taken Series A funding from Accel and Madrona. I want to thank Ping and everyone else at Accel for the support through the EIR journey over the past year. It has been an amazing opportunity to meet a ton of interesting people and explore this space in depth.

While Heptio is new, the goals and vision for what we are doing are part of a longer arc. For a long time now both Craig and I have been focused on bringing new experiences and efficiencies to the wider IT world. We did this through our work together building and launching Google Compute Engine (GCE) and then through our work creating Kubernetes. Heptio is the next chapter.

When I set out as an EIR over a year ago I had three things I wanted to line up before I started a company. Those were (1) preparing for the marathon, (2) finding the right co-founder and (3) have a compelling vision. I think we are in good shape on all three of these points.

First, I’ve taken the time over the last year and half (Google COBRA runs out this month) to take care of myself and recharge. I’ve spent lots of quality time with my kids and tried to pull my weight as my wife, Rachel, builds a panel at her small clinic here in Seattle.

On the co-founder front, I’m lucky to have worked with Craig for years. While I did a bit of “cofounder dating” over the last year and met a lot of great people, I’m most comfortable doing this with someone that I know I work well with. There will be ups and downs ahead and I know that Craig and I will work well together through it all.

The third point was vision. It has been incredibly exciting to see the Kubernetes community grow over the last several years. The core project itself is proving itself out across many dimensions. In addition, there is the start of a vibrant ecosystem of people building products, projects and services on top of and around Kubernetes. A recurring point of joy for me is to see new folks “get it” as they understand the unlimited potential of the world we are building together. The vision for Heptio is to accelerate and amplify by bringing Kubernetes to everyone.

Additionally, back in May 2015 I tweeted this:

I've decided my next job will be someplace that pushes me toward healthier interactions. Empathy and diversity.
— Joe Beda (@jbeda) May 12, 2015

Right now Heptio is, essentially, two middle aged white dudes. We need to change that. Both Craig and I are committed to building a company that welcomes everyone. I can’t say we have cracked the code here (has anyone?) but we will do our best to build the company we want vs accepting the industry default. We will also welcome ideas, input and criticism as we continue to learn.

Finally, it almost goes without saying that we are hiring. Right now we are looking to build a team in Seattle but are seriously considering what it takes to go fully distributed. If you want to join us in bringing the joys of “cloud native” infrastructure to everyone by working on open source software, please reach out to jobs@heptio.com and say that I sent you.

Recipe: Docker Logs → Google Stackdriver Logging

Sun, May 29, 2016

Docker logs are a pain to deal with. If you don’t do anything they’ll fill up your disk¹. In addition logs are easily lost if you need to delete and recreate a container.

Google Cloud Platform has a debug log service that has evolved from the (excellent) logging system built in to App Engine. With the acquisition and integration of Stackdriver it looks like this has been rebranded/merged.

Recently a new gcplogs Docker log driver was merged (by Mike Danese) and it looks like it is available in Docker 1.11. Getting it up and working is pretty easy but there are couple of gotchas.

Service Account Scopes

First, I did this on a GCE instance that already had the right service account scopes set up. It looks like you need the https://www.googleapis.com/auth/logging.write or https://www.googleapis.com/auth/cloud-platform² scopes.

My instance already had the logging.write scope so I was all set. I was able to tell by looking at the instance in the cloud console. Or you can use gcloud (as below). If you don’t see the right scope here you’ll need to restart your instance or pretend you aren’t on GCE.

$ gcloud compute instances describe --format='yaml(serviceAccounts[].scopes[])' my-instance
serviceAccounts:
- scopes:
  - https://www.googleapis.com/auth/compute
  - https://www.googleapis.com/auth/devstorage.read_write
  - https://www.googleapis.com/auth/logging.write
  - https://www.googleapis.com/auth/userinfo.email

If you aren’t on GCE, you’ll want to get a service account key and use that. I haven’t done this but it is documented as part of the Google Cloud Application Default Credentials system.

Activate the Logging API

You need to “enable” the logging API before you can use it. The fact that you have to manually toggle APIs is probably a historical artifact at GCP. Just go here and flip it on. Make sure you are pointed to the right project.

If you don’t do this first you may tickle a bug in Docker/aufs that will make the container un-deletable.

Launch a container

Now just launch a container and have it log to GCP.

$ docker run -d --name my-container \
  --log-driver=gcplogs \
  --log-opt gcp-log-cmd=true \
  [...]

You can also set gcplogs as the default for all containers in the daemon but I haven’t tried that yet.

View Logs on Cloud Console

You can now view these logs on the cloud console. Again, make sure you are pointed at the right project.

To be honest, there is a lot of structure here and the web UI isn’t as smooth as it could be. You can filter on these fields but you can’t control what is shown.

You can filter which events are shown. This bit is a bit confusing. It looks like there are 2 filter UIs. I was using the “advanced” filter UI that you can get to from the little down arrow icon to the right of the filter box. The easiest thing to filter on is the container name like so:

structPayload.container.name="/my-container"

In addition, there is a v1 syntax and a v2 syntax. I think this box uses the v1 syntax. It isn’t clear how deep the differences are.

View logs from the command line

Because of the lack of control on which fields are show in the log viewer, it is probably best to view logs from the command line. Note that this is beta level in gcloud so the examples here may break in the future.

First, install the beta gcloud tools if you don’t already have them.

gcloud components install beta

Now you can list the logs easily using gcloud beta logging read. I bundled this up into a shell function that is in my .bashrc.

function gcloud-docker-logs {
  local dayago
  if [ $(uname -s) = "Darwin" ]; then
    dayago=$(date -v-1d -u '+%Y-%m-%dT%H:%M:%SZ')
  else
    dayago=$(date  --date="-1 day" -Isec -u)
  fi

  gcloud beta logging read \
    --order ASC \
    --format='value(timestamp, jsonPayload.data)' \
    "jsonPayload.container.name=\"/$1\" AND timestamp > \"${dayago}\""
}

Decoder ring:

--order ASC makes it look like you are cating a log file.
--format='value(timestamp, jsonPayload.data)' picks the fields we want to show. This is just the timestamp and the log line. For some reason in the web UI this is called structPayload but here it is jsonPayload. Perhaps this is v1 vs v2 of the API and filter syntax?
"jsonPayload.container.name=\"/$1\"" is the filter to apply. Here we just slap in the name.
We limit this to the last day or logs. This happens by default when order is set to DESC but we want ascending order. Because of this we have to explicitly limit the date range. date is different on BSD systems (MacOS) vs. GNU systems (Linux). Ug.

Feedback for product groups

Docker logging should have rotation by default. The fact that out of the box it fills up your disk is horrible.
I’m sad that you still can’t change the scopes on a GCE instance while it is running. This has been in the bug database since scopes were first added.
There is a bunch of structure in that isn’t making it into the logging system. For instance, severity isn’t reflected (at least put stderr as WARN?). It also might be good to hoist the instance name and container name into the primary and secondary keys for logging. custom.googleapis.com/primary_key: "primary_key" looks like something was left stubbed out.
The logging UI should let you select what to show in addition to filtering the events. As more structure gets added the simple JSON serialization that is there now won’t work. In addition, that JSON serialization seems to be non-deterministic with respect to field order.
gcloud beta logging read should have a “follow” mode so that you can tail the logs. This mode exists in the UI.
The GCP logging filter syntax is confusing and inconsistent. Simple vs advanced. v1 vs v2.
Why am I the one writing this guide? I don’t work for Google anymore.

It isn’t that hard to set up log rotation but you have to do it. And after you do, docker logs can’t see old logs. Example on how to do that here. ↩
This scope is new since from when I worked at Google. I think it is a “catch all” scope across many GCP services. Kind of like root. I can’t find any documentation on it specifically. ↩

Taking a Leap: Entrepreneur in Residence at Accel

Wed, Nov 4, 2015

I’m excited to share that I’m going to be joining Accel as an Entrepreneur in Residence. I’m honored to be invited to work with Accel to explore ideas that will hopefully result in founding a new company.

I’m committed to staying in Seattle. While I’ll be traveling down to the Bay Area quite a bit to make the most out of this opportunity with Accel, my heart and family are in Seattle. Seattle has an amazing legacy and talent pool. The startup community is growing rapidly. I’m excited to explore joining it as a founder. In addition, I can be a bit of a bridge for Accel to the unique opportunities that are developing here in the Pacific Northwest. However, while I plan to stay in Seattle, I’m open to working with strong people wherever they may be based.

So what’s the story behind this move? After 10 years at Google, I went on leave back in February. I was a bit burned out and my goal was to relax and actually spend time with my family. After three months on leave, I was having so much fun I decided not to return to work at Google. My time at Google had been amazing – both in exposure to amazing technology and amazing people. I’ve learned an enormous amount and grown both as an engineer and how I view our industry. Over the summer our family did some traveling together: a camping road trip down the coast and a visit with family in Chicago. My wife and I also had a chance to celebrate our 15th wedding anniversary with a trip to London together. I am very fortunate to have been able to take this time off.

As the summer wore on, I knew I’d want to figure out what the next arc of my professional career was going to look like. To this end, even though I’ve been “funemployed”, I’ve been spending a ton of time meeting all sorts of people from around our industry. Talking with new people and hearing about what they are building has been invigorating.

One of the people that I’ve had a lot of great conversations with is Ping Li from Accel. I’ve known Ping now for over a year (time flies!) and I’ve recently started meeting some of the other folks at Accel. I’ve been impressed with the depth of thought that Accel brings when evaluating opportunities along with the courage to be the first investor in.

My first task is to figure out how to spell “entrepreneur” reliably. Past that, I’m want to do three things during my time with Accel:

Prepare myself for the marathon. I’ve never built or worked at a startup before. In this instance, I’m pretty out of the ordinary for an EIR. But I do know that it is a grueling process with very high highs and very low lows. I need to make sure that I’m ready.
Find the right co-founders. I’ve met lots of great people during my 17+ career in industry. There are plenty that I’d love to start a company with. All of them are currently employed. The EIR is a great way to get started while making sure the timing is right for any co-founders.
Put together a compelling vision. A technology is not a product and a product is not a business. While it is possible to start a company without much more than a general direction, that isn’t good enough for me. I want to have a good coherent vision for how whatever I’m going to do will hang together as a business. This will take time and research. I’m planning to tap Accel’s network and portfolio companies to validate ideas and plans.

Throughout my career, I’ve continually built platforms. I like building systems that help others do things that they couldn’t do before. My recent work to help start the Kubernetes project speaks to this. There is a good chance that wherever this leads will involve Kubernetes in some capacity.

I’m also taking personal responsibility to make sure that the workplace that I help to craft is empathetic and diverse. We have huge problems in our industry and I want to do my part to create some sanity.

Wherever this road leads, I’m sure it won’t be dull. I’ve enjoyed myself most when working with a small team to create something from nothing and I’m really looking forward to doing that again. If you’d like to chat in depth about this please don’t be afraid to reach out. I’m active on twitter as @jbeda. I’d be happy to throw ideas around there or over coffee.

Anatomy of a Modern Production Stack

Tue, Sep 8, 2015

(I’m updating this post as folks comment. You can look at the history on github.)

I was chatting on an Xoogler message board the other day and Dennis Ordanov (@daodennis) was asking about the basic moving parts of a production stack. I just started enumerating them from memory and thought it might be a good blog post¹. So, here is a mostly stream-of-consciousness dump of the parts a modern (container based) production environment².

A note on the term “modern”: This is my view, based on experiences at Google, for a stack that delivers what I’d want for a major production system. You can do this without containers, but I think it is hard to meet my criteria that way. The full stack here probably isn’t necessary for small applications and, as of today, is way too hard to get up and running. The qualities that I’d look for in a “modern” stack:

Self healing and self managing. If a machine fails, I don’t want to have to think about it. The system should just work.
Supports microservices. The idea of breaking your app into smaller components (regardless of the name) can help you to scale your engineering organization by keeping the dev team for each µs small enough that a 2 pizza team can own it.
Efficient. I want a stack that doesn’t require a lot of hand holding to make sure I’m not wasting a ton of resources.
Debuggable. Complex applications can be hard to debug. Having good strategies for application specific monitoring and log collection/aggregation can really help to provide insights into the stack.

So, with that, here is a brain dump of the parts that make up a “modern” stack:

Production Host OS. This is a simplified and manageable Linux distribution. Usually it is just enough to get a container engine up and running.
- Examples include CoreOS, Red Hat Project Atomic, Ubuntu Snappy, and Rancher OS.
Bootstrapping system. Assuming you are starting with a generic VM image or bare metal hardware, something has to be able to bootstrap those machines and get them running as productive members of the cluster. This becomes very important as you are dealing with lots machines that come and go as hardware fails.
- Cloud Foundry BOSH was created to do this for Cloud Foundry but is seeing new life as an independent product.
- The standard config tools (Puppet, Chef, Ansible, Salt) can serve this role.
- CoreOS Fleet is a lightweight clustering system that can also be used to bootstrap more comprehensive solutions.
Container Engine. This is the system for setting up and managing containers. It is the primary management agent on the node.
- Examples include Docker Engine, CoreOS rkt, and LXC and systemd-nspawn.
- Some of these systems are more amenable to being directly controlled remotely than others.
- The Open Container Initiative is working to standardize the input into these systems – basically the root filesystem for the container along with some common parameters in a JSON file.
Container Image Packaging and Distribution. A Container Image is a named and cloneable chroot that can be used to create container instances. It is pretty much an efficient way to capture, name and distribute the set of files that make up a container at runtime.
- Both Docker and CoreOS rkt solve this problem. It is built into the Docker Engine but is broken out for rkt as a separate tool set call acbuild.
- Inside of Google this was done slightly differently with a file package distribution system called MPM.
- Personally, I’m hoping that we can define a widely adopted spec for this, hopefully as part of the OCI.
Container Image Registry/Repository. This is a central place to store and load Container Images.
- Hosted versions of this include the Docker Hub, Quay.io (owned by CoreOS), and Google Container Registry.
- Docker also has an open source registry called Docker Distribution.
- Personally, I’m hoping that the state of the art will evolve past centralized solutions with specialized APIs to solutions that are simpler by working regular HTTP and more transport agnostic so that protocols like BitTorrent can be used to distribute images.
Container Distribution. This is the system for structuring what is running inside of a container. Many people don’t talk about this as a separate thing but instead reuse OS distributions such as Ubuntu, Debian, or CentOS.
- Many folks are working to build minimal container distributions by either using distributions based in the embedded world (BusyBox or Alpine) or by building static binaries and not needing anything else.
- Personally, I’d love to see the idea of a Container Distribution be further developed and take advantage of features only available in the container world. I wrote a blog post on this.
Container Orchestration System. Once you have containers running on a single host, you need to get them running across multiple hosts.
- This is a super hot area of interest with lots of innovation.
- Open source deployable examples include Kubernetes, Docker Swarm, and Apache Mesos.
- Hosted systems include Google Container Engine (GKE) (based on Kubernetes), Mesosphere DCOS and Amazon EC2 Container Service (ECS). Recently announced is the Microsoft Azure Container Service based on Mesosphere DCOS and Docker Swarm.
Orchestration Config. Many of the orchestration systems have small granular objects. Creating and parameterizing these by hand can be difficult. In this context, an orchestration config system can take higher level description and compile them down to the nuts and bolts that the orchestration systems works with.
- The Google solutions to this problem have never been made public (to my knowledge).
- AWS CloudFormation and Google Cloud Deployment Manager play this role for their respective cloud ecosystems (only).
- Hashicorp Terraform and Flabbergast look like they could be applied to container orchestration systems but haven’t yet.
- Docker Compose is a start to a more comprehensive config system.
- The Kubernetes team (Brian Grant especially) have lots of ideas and plans for this area. There is a Kubernetes SIG being formed.
Network Virtualization. While not strictly necessary, clustered container systems are much easier to use if each container has full presence on the cluster network. This has been referred to as “IP per Container”.
- Without a networking solution, orchestration systems must allocate and enforce port assignment as ports per host are a shared resource.
- Examples here include CoreOS Flannel, Weave, Project Calico, and Docker libnetwork (not ready for production yet). I’ve also been pointed to OpenContrail but haven’t looked deeply.
Container Storage Systems. As users move past special “pet” hosts storage becomes more difficult.
- I have more to say on this that I’ll probably put into a blog post at some point in the future.
- ClusterHQ Flocker deals with migrating data between hosts (among other things).
- I know there are other folks (someone pointed me at Blockbridge) that are working on software defined storage systems that can work well in this world.
Discovery Service. Discovery is a fancy term for naming. Once you launch a bunch of containers, you need to figure out where they are so you can talk to them.
- DNS is often used as a solution here but can cause issues in highly dynamic environments due to aggressive caching. Java, in particular, is troublesome as it doesn’t honor DNS TTLs by default.
- Many people build on top of highly consistent stores (lock servers) for this. Examples include: Apache Zookeeper, CoreOS etcd, Hashicorp Consul.
- Kubernetes supports service definition and discovery (with a stable virtual IP with load balanced proxy).
- Weave has a built in DNS server that stores data locally so that TTLs can be minimal.
- Related is a system to configure wider facing load balancer to manage the interface between the cluster and the wider network.
Production Identity and Authentication. As clustered deployments grow, an identity system becomes necessary. When microservice A calls microservice B, microservice B needs some way to verify that it is actually microservice A calling. Note that this is for computer to computer communication within the cluster
- This is not a well understood component of the stack. I expect it to be an active area of development in the near future. Ideally the orchestration system would automatically configure the identity for each running container in a secure way.
- Related areas include secret storage and authorization.
- I’ve used the term “Authentity” to describe this area. Please use it as I’m hoping it’ll catch on.
- conjur.net is a commercial offering that can help out in this situation.
Monitoring. A modern production application has deep monitoring. Not only should the operations folks make sure that the binaries continue to run, they should also be monitoring application specific metrics that are thrown off by each microservice that makes up that application.
- A modern monitoring solution is characterized by its ability to deal with a wide set of metrics along with flexible vector math necessary to do complex aggregations and tests. Time series data should be sampled frequently (30-60s or less), stored for a long time and be easily explored and graphed. A good monitoring system not only lets you know when things are down but also is a critical debugging tool to know where and how things are broken.
- For open systems, Prometheus looks very interesting in this area.
- There are also ways to break this apart into smaller parts such as Grafana as a frontend backed by a dedicated time series database like InfluxDB or OpenTSDB. I’ve also had multiple people point to Graphite in this space also.
- Heapster is a container specific monitoring system that surfaces data collected by cAdvisor.
- There are hosted systems such as Google Cloud Monitoring and SignalFx.
- I’m not an expert here so I’m sure I’m missing some of the awesome stuff going on in this area.
Logging. Logging can generally be broken into two types: unstructured debug logs and structured application logs. The debug logs are used to figure out what is going on with the system while structured logs are usually used to capture important events for later processing and aggregation. One might use structured logs for ad impressions that are critical for reconciling real money.
- Systems such as fluentd and logstash are agents that collect and upload logs.
- There are a ton of systems for storing and indexing logs. This includes elasticsearch along with more traditional databases (MySQL, Mongo, etc.).
- Hosted systems include Google Cloud Logging.
- Logging can also throw off monitoring signals. For instance, while processing saved logs the local agent can count 500s and feed those into a monitoring system.
- Systems like Apache Flume³ can be used to collect and reliably save structured logs for processing in the Hadoop ecosystem. Google BigQuery and Google Cloud Dataflow are also well suited to ingesting and analyzing structured log data.
Deep Inspection and Tracing There are a class of tools that help to do deep debugging.
- Inside of Google, Dapper is a great example of tracing a user request across many microservices. Appdash and Zipkin are open source system inspired by Dapper.
- Startups like Sysdig in this category too by allowing deep inspection and capture of what is going on with a machine.

PaaS systems often help to bring this all together in an easy way. Systems like OpenShift 3, Deis, or Flynn build on top of some of the independent systems above. Other PaaS such as Heroku, Google App Engine or Cloud Foundry are more vertically integrated without the component layers being broken out in a well supported way.

Next on the list would be to talk about continuous integration/continuous deployment (CI/CD) systems and systems for communicating between microservices (RPC and queues). But I think I’ll stop here. If this is useful (or if you think I’m missing anything huge) please let me know via twitter. Or you can comment on the Hacker News thread.

Brandon Philips from CoreOS points me to a similar post from Barak Michener. I go into more minutia here and don’t try and define a strict stack. ↩
Some caveats:
- I’m sure I’m missing some parts of the stack.
- The way I break this problem down is based on my experiences at Google. There are many ways to skin this cat.
- I’ve listed example projects/products/companies/systems at different levels but this isn’t meant to be exhaustive.
- The fact that I’ve listed a system here doesn’t mean that I’ve run it in production and it has my stamp of approval.
↩
Don’t confuse Apache Flume with Google FlumeJava. I guess once you start processing logs some names are just obvious. Also see Google Sawzall and Google Dremel. ↩

Operations Lock-in vs. Development Lock-in

Tue, Sep 1, 2015

I’m not a large enterprise developer. My experiences are at the unique environments of Google and Microsoft¹. Over my career I’ve helped build platforms and I’ve learned that to build a great developer platform (or any good product at all) requires you to talk to many customers and put yourself in their mindset. This extends from the day to day usability of the product to the decision of a customer to bet a part of his/her business and career on the product in the first place.

Lock-in is obviously one of the critical things that many customers take into account. But lock-in isn’t absolute. Different types of lock-in are defined by the friction and cost incurred in switching.

One interesting way to view lock-in is by looking at who will bear the brunt that switching cost. You can look at this from the point of view of Ops vs. Dev².

With Operations Lock-in the operations team is primarily involved in a migration from one place to another. The operations playbooks, policies, and procedures may have to change but the code will be substantially unchanged. Ideally, nothing needs to be recompiled or rebuilt.
With Developer Lock-in someone has to actually go in and crack the code. This could be something minor (swapping in a new library with a standard language interface) to more architectural changes (changing your database).

Different organizations have different skill sets and one type of lock-in might be more serious than another. While someplace like Google or Microsoft may have high quality software engineers to spare, most large enterprises are severly constrained here. Often projects were written by developers that are long gone – they were independent contractors, left the company, or have moved on to other projects. In this case developer lock-in can be much more costly than operations lock-in.

Think about it this way – a user has a large system built on open software packages running on on-prem hardware. Switching to a VM/IaaS offering is a relatively cheap project as it can be an operations focused project with little to no developer involvement. As IaaS offerings evolved to be more and more similar to real hardware (persistent network block stores, familiar networking) this trend really accelerated.

On the other hand, if a user builds a software product around a very proprietary set of APIs (either language APIs or service APIs) switching can be much more expensive. Many PaaS offerings would be in this category. For example, moving a large application away from AWS Lambda would require a rewrite. Google App Engine has similar issues if the native datastore is being used.

This is all critical to keep in mind as we create new ways to build and run systems. Many times betting on open APIs³ and open source systems helps to fight lock-in by making switching an operations project vs. a development project. This is why, despite the eye-rolls, I’m excited about the emergance of the Open Container Initiative and the Cloud Native Computing Foundation.

What other ways do you think about lock-in? Tweet at me (@jbeda) and let’s start a discussion.

I started my career on Internet Explorer and moved on to create graphics APIs (including XAML) for Windows Presentation Foundation (WPF). At Google I started on Google Talk building out XMPP specifications for call signaling and eventually started Google Compute Engine and helped to start Kubernetes. ↩
There are, of course, other types of lock-in. One that is really interesting is data locality lock-in. As data gets bigger and bigger, simply moving that data around can be complicated and costly. This gets even more complicated as that data continues to grow while it is being moved. ↩
With the advent of cloud APIs, we’ve actually taken a bit of a step back in terms of open APIs. While there are some companies (such as Slack) that encourage alternate implementations of their APIs, many companies are silent or hostile to others implementing those APIs. I’d love to see a pattern of companies releasing their APIs with an open license or donating the IP to a foundation. This is made only more critical in light of the Oracle API copyright decision. ↩

Container Native Package System

Wed, Jul 1, 2015

A lot of exciting things happened at Dockercon 2015 last week. For me the most exciting was the announcement of the Open Container Project foundation. Not only is it great to see the community come together under one banner but also as a chance to entertain new ideas in this space.

These are some thoughts on how to improve what we consider a “container image.” I’m looking at both the container format itself and what goes on inside of it. This obviously builds on ideas in other systems and I’ve tried to call those out. These thoughts are still early so I’m hoping to find others of like mind and start a good discussion.

History and Context

The thing I’ve been exploring most recently is the intersection between the container image format and package management. While there has been plenty of attention on base OSs to host the container (CoreOS, RancherOS, Project Atomic, Snappy Ubuntu) and efforts to coordinate a cluster of hosts (Kubernetes, Mesosphere, Docker Swarm) we haven’t paid as much attention as we could to what is going on inside the container.

Docker Images are great. Images are pretty efficient to push and pull and, with new focus on security, it is getting easier and easier to be sure that what you want in the image is actually what you are running.

Dockerfiles are also great. They are a purpose built makefile analog that are super easy to understand and logically build on the layered nature of Docker images. Like most of the Docker project, they are much more approachable than other efforts in this area and solve real customer needs. When constructed appropriately, they allow for an efficient dev flow where many of the time consuming steps can be reused.

One of the best innovations of Docker is actually a bit of an awesome hack. It leverages the package managers for existing Linux distributions. Reusing the package manager means that users can read any number of guides on how to get software installed and easily translate it into a Dockerfile.

Think of it this way: a typical Linux distribution is 2 things. First is a bunch of stuff to get the box booted. Second is a package manager to install and manage software on that box. Docker images typically only need the second one. The first one is along for the ride even if the user never needs it. There are package managers out there that are cleanly factored from the underlying OS (Homebrew, Nix) but they aren’t typically used in Docker images.

Problems

This all mostly works okay. There is some cruft in the images that can easily be ignored and is “cheap” as the download and storage cost is amortized because of layer sharing for Docker images.

But we can do better.

Problems with the current state of the world:

No package introspection. When the next security issue comes along it is difficult to easily see which images are vulnerable. Furthermore, it is hard to write automated policy to prevent those images from running.
No easy sharing of packages. If 2 images install the same package, the bits for that package are downloaded twice. It isn’t uncommon for users to construct complicated “inheritence” chains to help work around this issue¹.
No surgical package updating. Updating a package requires recreating an image and re-running all downstream actions in the Dockerfile. If users are good about tracking which sources go into which image², it should be possible to just update the package but that is difficult and error prone.
Order dependent image builds. Order matters in a Dockerfile — even when it doesn’t have to. Often times two actions have zero interaction with each other. But Docker has no way of knowing that so must assume that every action depends on all preceding actions.
Package manager cruft. Most well built Dockerfiles have something like:
```
RUN apt-get update \
  && apt-get install -y --no-install-recommends \
    build-essential \
  && rm -rf /var/lib/apt/lists/*
```
This helps to minimize the size of the layer on disk. This is confusing boilerplate that is likely just cargo-culted by many users.

Ideas for Solutions

While I don’t have a fully formed solution to all of these problems, I do have a bunch of ideas.Imagine that we take the idea of a container image and break it down a little.

The first thing we define is a FileSet. A FileSet is a named, versioned and verified set of files. Google has a system internally called the “Midas Package Manager” (MPM) that does this³. Dinah McNutt gave a great talk on MPM at a 2013 USENIX conference. A further tweak would allow a FileSet to import other FileSets into the file namespace of the host. This allows for a FileSet to have multiple “parents” – unlike the current Docker layered image format.

Second, we define a Package as a type of FileSet. It would have a standard directory structure and include metadata on other packages required along with simple instructions to “install” the package⁴. Ideally, these packages would be built from verified sources and a verified tool chain to enable true provenance for every bit. This would be optional.

Finally, we would redefine a ContainerImage also as a type of FileSet that has metadata necessary to make it runnable. The definition of this metadata is a big part of what the Docker Image format and the ACI format are.

A ContainerImage that is using this container native package system would define a set of read-only imports (using the FileSet import feature described above) of all required packages FileSets. Image construction tools would verify that all package dependencies are satisfied. Furthermore, the install steps would be run to symlink⁵ all of the packages into the appropriate places⁶.

User code could either be packed up as a Package or just inserted directly into the ContainerImage.

Analysis

By creating a container friendly packaging system and expanding the idea of what an container image is, we can solve most of the issues outlined again.

The list of FileSets imported into, say, /packages would be the list of all packages versions that are included in that image.
Individual FileSets could be cached by hosts and easily and safely shared between disparate images.
A package could be updated in a straightforward way. The toolset would have to make sure that all dependencies are satisfied and that the install steps are run as necessary.
Image build tools would list the packages necessary and order wouldn’t matter. Because there are multiple “parents” to an image, order cannot matter.
The package install cruft (archived version of the package) would be handled on the host side similar to images themselves. The only thing the container would see would be the actual files – and they would be symlinked in.

There are some missing and underdefined parts to this story.

How are packages created? I’m thinking that we could do that by running a container with build time packages that produces output file into a specific directory. Files in that directory are then used to create a package. As part of this the inputs into the build container could be included in the package metadata and signed.
What does a package distribution look like? I imagine we’d have curated sets of packages that are known to work well together. For instance, xyz.com could create xyz.com/apache that depends on xyz.com/openssl.
How do users override packages? Perhaps abc.com/openssl could specify that it can be used in place of xyz.com/openssl. Any guarantees by xyz.com would be void but it would be a way to do custom versions and carry patches.
Opportunity: Kernel and capability requirements. Packages could specify their requirements in a way that would be visible to the host. This would provide a more direct requirement chain between the host and the code running in the container.

This solution obviously borrows from both Homebrew and Nix. What I think is new here is the idea of expanding the definition of an container image with FileSets and making this be fundamentally decentralized. We also need to ensure that the easy to approach spirit of Dockerfile isn’t lost. If we do this right we can make images much easier to efficiently create, verify, update and manage.

Ping back to me on twitter (@jbeda) or we can talk over on Hacker News

(Thanks to Kelsey Hightower for reviewing an earlier version of this post.)

The standard golang image is a great example of this. golang:1.4.2 → buildpack-deps:jessie-scm → buildpack-deps:jessie-curl → debian:jessie. Most of this is done to enable efficient sharing of installed packages. ↩
Best practices should be to track every single input into your docker file. That means that if you are pushing sources you should know which git commit, for example, those sources come from. My guess is that this is rarely done. ↩
Actually, we need our system to be decentralized. MPM, like may package management systems (including Homebrew and Nix) has a single central repository/database of all packages. Whatever is used here must be distributed — probably in a namespace rooted with DNS. Something like Docker Notary would play a role in signing and verifying packages. Something like the Nix archive format (NAR) will help make this more stable and predictable. ↩
Package install should consist of simply symlinking files into some common directories (/bin, /lib). This would all be done via a declarative manifest. There are probably going to be cases where an “install” is a little bit more complex and a script is necessary. I’d love to see how far we could get before that becomes absolutely necessary. It is also assumed that the package directories themselves are only ever mounted read only. ↩
Josh Wood (@joshixisjosh9), via twitter, points out some issues with using symlinks. An alternative here would be to use bind mounts. But it is unclear how many bind mounts Linux can handle (100 containers with 100 bind mounts = 10,000 bind mounts) and setting them up requires root privledges. ↩
There is a choice on when the package install happens. It could happen early as the container is created. Or it could happen late as part of the container start process. I’d prefer late binding as it makes surgical package updating simpler. The directories that store the symlinks could be tmpfs directories to keep this all very speedy. ↩

A New Beginning...

Wed, Jun 10, 2015

Welcome to the new 80%!

I’ve rewritten my ancient blog on top of Hugo using Bootstrap and Google Fonts. The whole thing is hosted on GitHub if you want to check out the source.

For headlines, I’m using Economica. I picked it because it is narrow and I tend to write long headlines. The body text is Lora. I like a nice Serif font for the body text and it has a bit of style while still being very readable. Code is in Roboto Mono and the Logo and Navigation is in Coda.

I’m not in love with the color theme I have now (dark gray background with red and blue highlights) but it’ll do for now. I played with some CSS gradients to show a preview on the main index page. I’m still not sure – it may be a little too cute.

I’ve done the work of importing my old blog posts and making Hugo create compatible URLs. My original system used XML files for each day so I wrote a short Go program to convert from the XML to something Hugo can consume (markdown files with a JSON header). I had to introduce a verbatim HTML shortcode to make Hugo pass through the HTML from the old blog directly.

The other tricky thing was to generate a page per day instead of a page per post for these old posts. Rolling things up on a daily basis was the way things were done back when I started the blog in January 2003. Doing this daily roll up used the taxonomy feature of Hugo where the taxonomy was “archive” and the term was the day. This was easy to generate from the Go conversion program.

I’m hosting this on GCE under Docker and NGINX. I’m going to write up a post on how I’m doing that and automatically syncing it to the git repro on each submit.

Now that I have it all set up, we’ll see if I actually have anything to say…

HOWTO: Installing Highpoint Rocketraid 222x on Ubuntu Dapper (6.06 LTS)

Tue, Sep 18, 2007

I haven't updated my blog in forever and I'm probably going to abandon my homebrew static client generated site for something like Mephisto at some point but I haven't had time to make the transition.

In the meantime, I'd like to save people some pain and document the steps I have go through to upgrade my Highpoint RocketRaid 2220 on Linux. I installed the driver a while ago and don't remember the exact steps for that, so this is just what I do to upgrade. I wrote a little shell script:

#! /bin/sh -v

# Update this version every time you upgrade.
VER=2.6.15-29-686

# To update the highpoint driver:
# 1) Download latest highpoint driver from: 
#
# http://www.highpoint-tech.com/BIOS_Driver/rr222x/Linux/
#
# 2) Patch driver by changing wrong kernel #ifdefs in osm/linux/os_linux.c  
#   KERNEL_VERSION(2,6,15) -> KERNEL_VERSION(2,6,16)

# Disable sata_mv.ko by moving it to a new directory.  This driver
# conflicts with highpoint driver.  I don't know if this is the
# "right" way to do this, but it works.
sudo mkdir -p /lib/modules/${VER}.disabled
sudo mv /lib/modules/${VER}/kernel/drivers/scsi/sata_mv.ko /lib/modules/${VER}.disabled/

# Make sure kernel headers are installed
sudo apt-get install linux-headers-${VER}

# Make new hpt driver:
cd ~jbeda/sources/rr222x-linux-src-1.07/product/rr2220/linux/
make KERNELDIR=/lib/modules/${VER}/build
sudo make install KERNELDIR=/lib/modules/${VER}/build

# make a new ramfs
# mkinitramfs -o /boot/initrd.img-2.6.15-27-686 /lib/modules/2.6.15-27-686
sudo dpkg-reconfigure linux-image-${VER}

Good luck. I hope this helps people out there that are stuck with this thing. I'm still looking for a good cheap solution to host lots of SATA drives on Linux. Port multipliers are there and aren't as cheap as they should be. The driver situation is pretty dire and there aren't that many non RAID (fake or not) >4 ports out there. I haven't tried any SAS cards though -- perhaps the situation there is better. I'm also running an LSI MegaRAID SATA 300-8XLP with the megaraid driver. It wasn't as cheap but at least it works with a true open driver.

"Avalon marks the end of the American Dream"

Thu, Aug 3, 2006

Miguel de Icaza says "Avalon marks the end of the American Dream." He also compares it to J2EE -- apparently implying that it is overly complex and overarchitected.

Ouch.

While I wouldn't put it that way, I can't disagree. I left Microsoft almost two years ago and Avalon still hasn't shipped. A 5+ year ship cycle for a project can't be seen as anything but a sign that something is horribly wrong. When I was on Avalon we kept talking about building an API for the next 10 years. Apparently, when Avalon ships there will be 5 years left on that clock.

I take partial responsibility for this. When we were first starting Avalon, I was all about "Go big or go home" and "We should build something only Microsoft can build." In retrospect, the project and the company might have been better served by starting with a much smaller team, aiming lower to start and shipping 5 times over those 5 years. Version 1 might not have been that impressive, but relentless improvement would have built something better factored, simpler, and more in tune with what users actually need.

I named this blog "eightypercent" in honor of the 80% rule. It just so happens there there are lots of 80% rules to apply. In this case, a simpler system that only solved 80% of the problem would have been good enough and would have shipped multiple times already.

It looks like the WPF/E project is an effort to strip Avalon down to something much more approachable. Cross platform, no full CLR, lower memory footprint -- sounds a lot like Flash/Flex. I know some of the guys working on the project and I have high hopes that it will be something interesting. The only question, when will it ship?

Seattle's Homeless Alcoholics on NPR

Wed, Jul 19, 2006

Coming in this morning, I heard a segment on NPR covering a unique program that King Country is running to provide rooms for homeless alcoholics in Seattle. The unique and controversial part of this program is that the residents can continue to drink.

My wife, Rachel, has first hand experience with this problem from her work at the Harborview ER. Some of these "frequent fliers" are indeed part of the community at the ER. In fact, one of the patients that Rachel had lots of interactions with (ever since she was a med student!) recently died and it really shook her up. Anecdotally, Rachel has seen this program provide a positive impact.

In any case, this novel program seems to me to be a unique way to approach a very difficult problem. It reduces the cost to the taxpayers and provides a safe place for these individuals. Obviously we would all like to see these problems solved, but, failing that, at least the county is trying to manage it.