Docker’s momentum has been increasing by the week, and from that it’s clearly touching on real problems. However, for many production users today, the pros do not outweigh the cons. Docker has done fantastically well at making containers appeal to developers for development, testing and CI environments—however, it has yet to disrupt production. In light of DockerCon 2015’s “Docker in Production” theme I’d like to discuss publicly the challenges Docker has yet to overcome to see wide adoption for the production use case. None of the issues mentioned here are new; they all exist on GitHub in some form. Most I’ve already discussed in conference talks or with the Docker team. This post is explicitly not to point out what is no longer an issue: For instance the new registry overcomes many shortcomings of the old. Many areas that remain problematic are not mentioned here, but I believe that what follows are the most important issues to address in the short term to enable more organizations to take the leap to running containers in production. The list is heavily biased from my experience of running Docker at Shopify, where we’ve been running the core platform on containers for more than a year at scale. With a technology moving as fast as Docker, it’s impossible to keep everything current. Please reach out if you spot inaccuracies.
Building container images for large applications is still a challenge. If we are to rely on container images for testing, CI, and emergency deploys, we need to have an image ready in less than a minute. Dockerfiles make this almost impossible for large applications. While easy to use, they sit at an abstraction layer too high to enable complex use-cases:
- Out-of-band caching for particularly heavy-weight and application-specific dependencies
- Accessing secrets at build time without committing them to the image
- Full control over layers in the final image
- Parallelization of building layers
Most people do not need these features, but for large applications many of them are prerequisites for fast builds. Configuration management software like Chef and Puppet is widespread, but feel too heavy handed for image building. I bet such systems will be phased out of existence in their current form within the next decade with containers. However, many applications rely on them for provisioning, deployment and orchestration. Dockerfiles cannot realistically capture the complexity now managed by config management, but this complexity needs to be managed somewhere. At Shopify we ended up creating our own system from scratch using the docker commit API. This is painful. I wish this on nobody and I am eager to throw it out, but we had to to unblock ourselves. Few will go to this length to wrangle containers to production.
What is going to emerge in this space is unclear, and currently it’s not an area where much exploration is being done (one example is dockramp, another packer). The Docker Engine will undergo work in the future to split the building primitives (adding files, setting entrypoints, and so on) from the client (Dockerfile). Work merged for 1.8 will already make this easier, opening the field for experimentation by configuration management vendors, hobbyists, and companies. Given the history of provisioning systems it’s unrealistic to believe a standard will settle for this problem, like it has for the runtime. The horizon for scalable image building is quite unclear. To my knowledge nobody is actively iterating and unfortunately it’s been this way for over a year.
Every major deployment of Docker ends up writing a garbage collector to remove old images from hosts. Various heuristics are used, such as removing images older than x days, and enforcing at most y images present on the host. Spotify recently open-sourced theirs. We wrote our own a long time ago as well. I can understand how it can be tough to design a predictable UI for this, but it’s absolutely needed in core. Most people discover their need by accident when their production boxes scream for space. Eventually you’ll run into the same image for the Docker registry overflowing with large images, however, that problem is on the distribution roadmap.
Iteration speed and state of core
Docker Engine has focused on stability in the 1.x releases. Pre-1.5, little work was done to lower the barrier of entry for production uptake. Developing the public mental model of containers is integral to Docker’s success and they’re rightly terrified of damaging it. Iteration speed suffers when each UX change goes through excessive process. As of 1.7, Docker features experimental releases spearheaded by networking and storage plugins. These features are explicitly marked as “not ready for production” and may be pulled out of core or undergo major changes anytime. For companies already betting for Docker this is great news: it allows the core team to iterate faster on new features and not be concerned with breaking backwards compatibility between minor versions in the spirit of best design. It’s still difficult for companies to modify Docker core as it either requires a fork – a slippery slope and a maintenance burden – or getting accepted upstream which for interesting patches is often laborious. As of 1.7, with the announcement of plugins, the strategy for this problem is clear: Make every opinionated component pluggable, finally showing the fruits of the “batteries swappable, but included” philosophy first introduced (although rather vaguely) at DockerCon Europe 2014. At DockerCon in June it was great to hear this articulated under the umbrella of Plumbing as a top priority of the team (most importantly for me personally because plumbing was mascotted by my favorite marine mammal, the walrus). While the future finally looks promising, this remains a pain point today as it has been for the past two years.
One example of an area that could’ve profited from change earlier is logging. Hardly a glamorous problem but nonetheless a universal one. There’s currently no great, generic solution. In the wild they’re all over the map: tail log files, log inside the container, log to the host through a mount, log to the host’s syslog, expose them via something like fluentd, log directly to the network from their applications or log to a file and have another process send the logs to Kafka. In 1.6, support for logging drivers was merged into core; however, drivers have to be accepted in core (which is hardly easy). In 1.7, experimental support for out-of-process plugins was merged, but – to my disappointment – it didn’t ship with a logging driver. I believe this is planned for 1.8, but couldn’t find that on official record. At that point, vendors will be able to write their own logging drivers. Sharing within the community will be trivial and no longer will larger applications have to resort to engineering a custom solution.
In the same category of less than captivating but widespread pickles, we find secrets. Most people migrating to containers rely on configuration management to provision secrets on machines securely; however, continuing down the path of configuration management for secrets in containers is clunky. Another alternative is distributing them with the image, but that poses security risks and makes it difficult to securely recycle images between development, CI, and production. The most pure solution is to access secrets over the network, keeping the filesystem of containers stateless. Until recently nothing container-oriented existed in this space, but recently two compelling secret brokers, Vault and Keywhiz, were open-sourced. At Shopify we developed ejson a year and a half ago to solve this problem to manage asymmetrically encrypted secrets files in JSON; however, it makes some assumptions about the environment it runs in that make it less ideal as a general solution compared to secret brokers (read this post if you’re curious).
Docker relies on CoW (Copy on Write) from the filesystem (great LWN
series on union filesystems, which enable CoW). This is to make
sure that if you have 100 containers running from an image, you don’t need
100x<size of image> disk space. Instead, each container creates a CoW layer on top
of the image and only uses disk space when it changes a file from the original
image. Good container citizens have a minimal impact on the filesystem inside
the container, as such changes means the container takes on state, which is a
no-no. Such state should be stored on a volume that maps to the host or over the
network. Additionally, layering saves space between deployments as images are
often similar and have layers in common. The problem with file systems that
support CoW on Linux is that they’re all somewhat new. Experience with a handful
of them at Shopify on a couple hundreds of hosts under significant load:
- AUFS. Seen entire partitions lock up where we had to remount it. Sluggish and uses a lot of memory. The code-base is large and difficult to read, which is likely why it hasn’t been accepted into upstream and thus requires a custom kernel.
- BTRFS. Has a learning curve through a new set of tools as du and ls don’t work. As with AUFS, we’ve seen partitions freeze and kernels lock up despite playing cat and mouse with kernel versions to stay up to date. When nearing disk space capacity, BTRFS acts unpredictably, and the same goes if you have 1000s of these CoW layers (subvolumes in BTRFS-terminology). BTRFS uses a lot of memory.
- OverlayFS. This was merged into the Linux kernel in 3.18, and has been quite stable and fast for us. It uses a lot less memory as it manages to share the page cache between inodes. Unfortunately it requires you run a recent kernel not adopted by most distributions, which often means building your own.
Luckily for Docker, Overlay will soon be ubiquitous, but the default of AUFS is still quite unsafe for production when running a large amount of nodes in our experience. It’s hard to say what to do here though since most distributions don’t ship with a kernel that’s ready for Overlay either (it’s been proposed and rejected as the default for that reason), although this is definitely where the space is heading. It seems we just have to wait.
Reliance on edgy kernel features
Just as Docker relies on the frontier of file systems, it also leverages a large number of recent additions to the kernel, namely namespaces and (not-so-recent, but also not too commonly used) cgroups. These features (especially namespaces) are not yet battle-hardened from wide adoption in the industry. We run into obscure bugs with these once in a while. We run with the network namespace disabled in production because we’ve experienced a fair amount of soft-lockups that we’ve traced to the implementation, but haven’t had the resources to fix upstream. The memory cgroup uses a fair amount of memory, and I’ve heard unreliable reports from the wild. As containers see more and more use, it’s likely the larger companies that will pioneer this stability work.
An example of hardening we’ve run into in production would be
zombie processes. A container runs in a PID namespace which means that the
first process inside the container has pid 1. The init in the container needs to
perform the special duty of acknowledging dead children.
When a process dies, it doesn’t immediately disappear from the kernel process
data structure but rather becomes a zombie process. This ensures that its parent
can detect its death via
wait(2). However, if a child process is orphaned its
parent is set to init. When that process then dies, it’s init’s job to
acknowledge the death of the child with
wait(2)—otherwise the zombie sticks
around forever. This way you can exhaust the kernel process data structure with
zombie processes, and from there on you’re on your own. This is a fairly common
scenario for process-based master/worker models. If a worker shells out and it
takes a long time the master might kill the worker waiting for the shelled
command with SIGKILL (unless you’re using process groups and killing the entire
group at once which most don’t). The forked process that was
shelled out to will then be inherited by init. When it finally finishes, init
wait(2) on it. Docker Engine can solve this problem by the Docker
Engine acknowledging zombies within the containers with
PR_SET_CHILD_SUBREAPER, as described in #11529.
Runtime security is still somewhat of a question mark for containers, and to get it production hardened is a classic chicken and egg security problem. In our case, we don’t rely on containers providing any additional security guarantees. However, many use cases do. For this reason most vendors still run containers in virtual machines, which have battle-tested security. I hope to see VMs die within the next decade as operating system virtualization wins the battle, as someone once said on the Linux mailing list: “I once heard that hypervisors are the living proof of operating system’s incompetence”. Containers provide the perfect middle-ground between virtual machines (hardware level virtualization) and PaaS (application level). I know that more work is being done for the runtime, such as being able to blacklist system calls. Security around images has been cause for concern but Docker is actively working on improving this with libtrust and notary which will be part of the new distribution layer.
Image layers and transportation
The first iteration of Docker took a clever shortcut for image builds, transportation and runtime. Instead of choosing the right tool for each problem, it chose one that worked OK for all cases: filesystem layers. This abstraction leaks all the way down to running the container in production. This is perfectly acceptable minimum viable product pragmatism, but each problem can be solved much more efficiently:
- Image builds could be represented as a directed graph of work. This allows figuring out caching and parallelization for fast, predictable builds.
- Image transportation instead of using image layers it could perform binary diffing. This is a topic that has been studied for decades. The distribution and runtime layer are getting more and more separated, opening up for this sort of optimization.
- Runtime should just do a single CoW layer rather than using the arbitrary image layer abstraction again. If you’re using a union filesystem such as AUFS on the first read you’re traversing a linked list of files to assemble the final file. This is slow and completely unnecessary.
The layer model is a problem for transportation (and for building, as covered earlier). It means that you have to be extremely careful about what is in each layer of your image as otherwise you easily end up transporting 100s of MBs of data for a large application. If you have large links within your own datacenter this is less of a problem, but if you wish to use a registry service such as Docker Hub this is transferred over the open Internet. Image distribution is being worked on actively currently. There’s a lot of incentive for Docker Inc to make this solid, secure and fast. Just as for building, I hope that this will be opened for plugins to allow a great solution to surface. As opposed to the builder this is somewhere people can generally agree on a sane default, with specialized mechanisms such as bittorrent distribution.
Many other topics haven’t been discussed on purpose, such as storage, networking, multi-tenancy, orchestration and service discovery. What Docker needs today is more people going to production with containers alone at scale. Unfortunately, many companies are trying to overcompensate from their current stack by shooting for the stars of a PaaS from the get go. This approach only works if you’re small or planning on doing greenfield deployments with Docker—which rarely run into all the obscurities of production. To see more widespread production usage, we need to tip the pro/con scale in favour of Docker by resolving some of the issues highlighted above.
Docker is putting itself in an exciting place as the interface to PaaS be it discovery, networking or service discovery with applications not having to care about the underlying infrastructure. This is great news, because as Solomon says, the best thing about Docker is that it gets people to agree on something. We’re finally starting to agree on more than just images and the runtime.
All of the topics above I’ve discussed in length with the great people at Docker Inc, and GitHub Issues exist in some capacity for all of them. What I’ve attempted to do here, is simply provide an opinionated view of the most important areas to ramp down the barrier of entry. I’m excited for the future—but we’ve still got a lot of work left to make production more accessible.
My talk at DockerCon EU 2014 on Docker in production at Shopify
Talk at DockerCon 2015 on Resilient Routing and Discovery.
You should follow me on Twitter here.