The Overlay Filesystem

The overlay filesystem (formally known as overlayfs) was merged into the mainline Linux kernel at version 3.18 in December 2014. Whilst other, similar union mount filesystems have been around for many years (notably, aufs), overlay is the first to become integrated into the Linux kernel.

An overlay sits on top of an existing filesystem, and combines an upper and a lower directory tree (which can be from different filesystems), in order to present a unified representation of both directory trees. Where objects with the same name exist in both directory trees, then their treatment depends on the object type:

  • File: the object in the upper directory tree appears in the overlay, whilst the object in the lower directory tree is hidden
  • Directory: the contents of each directory object are merged to create a combined directory object in the overlay

The lower directory can be read-only, and could be an overlay itself, whilst the upper directory is normally writeable. In order to create an overlay of two directories, dir1 and dir2, we can use the following mount command:

mount -t overlay -o lowerdir=./dir1,upperdir=./dir2,workdir=./work overlay ./dir3  

A union of the two directories is created as an overlay in the dir3 directory. The workdir option is required, and used to prepare files before they are switched to the overlay destination in an atomic action (the workdir needs to be on the same filesystem as the upperdir). The following illustrates a simple example of the overlay mount above:

Overlay Mount

When a file or directory that originates in the upper directory is removed from the overlay, it's also removed from the upper directory. If a file or directory that originates in the lower directory is removed from the overlay, it remains in the lower directory, but a 'whiteout' is created in the upper directory. A whiteout takes the form of a character device with device number 0/0, and a name identical to the removed object. The result of the whiteout creation means that the object in the lower directory is ignored, whilst the whiteout itself is not visible in the overlay. The following illustrates the creation of a whiteout in the upperdir on removal of the file mango:

$ ls -l ./dir3/fruit
total 72  
-rw-rw-r-- 1 bill bill  1320 May 20 12:39 apple
-rw-rw-r-- 1 bill bill    92 May 20 11:53 grape
-rw-rw-r-- 1 bill bill 63456 May 20 11:53 mango
$ rm ./dir3/fruit/mango
$ ls -l ./dir3/fruit
total 8  
-rw-rw-r-- 1 bill bill  1320 May 20 12:39 apple
-rw-rw-r-- 1 bill bill    92 May 20 11:53 grape
$ ls -l ./dir2/fruit
total 4  
-rw-rw-r-- 1 bill bill 1320 May 20 12:39 apple
c--------- 1 bill bill 0, 0 May 20 17:38 mango  

Linux kernel 4.0 further extends the overlay capabilities, to enable multiple lower directories to be specified, separated by a :, with the rightmost lower directory on the bottom, and the leftmost lower directory on the top of the union. For example:

mount -t overlay -o lowerdir=./dir3:./dir2:./dir1 overlay ./dir4  

In this extended version, the upperdir is optional, and if it is omitted, then the workdir option is also optional, and will be ignored in any case. In this scenario, the overlay will be read-only.

At the time of writing, Linux kernel version 4.0 is very new, and will not have found its way into many Linux distributions.

Use Cases

Union filesystems are often used for Live CD creation, where a read-only image is augmented with a writeable layer in tmpfs, thereby enabling a dynamic, but ephemeral session.

Effectively, this is 'copy-on-write', where read-only data is used until such time as the data requires changing, whereupon it is copied and altered in the read-write layer. This copy-on-write mechanism is used in the creation of filesystems for Linux containers, used by container runtime environments like Docker or rkt. It's not the only option for assembling container filesystems, but it is one of the more performant, because it allows pages in the kernel's page cache to be shared between containers - an option which is not available with block device copy-on-write mechanisms, such as the device mapper framework with the thinp target, or the btrfs filesystem. Docker's overlay graphdriver is currently the last in the queue for automatic selection (vfs is used for testing), behind aufs, btrfs and the devicemapper graphdrivers, but as the remaining issues are closed out, I expect it to become the default.

What's in a name?

For those unfamiliar with Docker, it's a platform for creating and running isolated workloads in containers based on a template image. The image repositories are stored in registries, which are retrieved using a fully qualified image name (FQIN) as part of a Docker CLI command or API call, as explained here. Docker's method of addressing repositories all seems very logical, but, just recently a situation arose which resulted in some disagreement between project contributors.

Everyone knows that Red Hat is a major platform and cloud provider, as well as having aspirations in the embryonic container market for workloads; it is a partner of Docker Inc, and a significant contributor to the Docker project. It's not unusual for Red Hat to adopt open source technologies, and to make a value-added product available to their customers as a downstream distribution. This is what it did with OpenStack, for example. However, when Red Hat provided its customers with access to Docker 1.5 via its repositories, it came with some 'experimental' code changes not available in the upstream Docker release. This in itself may not have been a big issue, but the nature of the changes caused Docker Inc's CTO and the Docker project's Chief Architect, Solomon Hykes, to take issue. In his view, the changes broke one of the project's architectural principles; that is, that the user should expect the same result when using a CLI command or API call to retrieve images, wherever they might issue the command or call.

In short, Red Hat's code changes provided two new command line options for the Docker engine, enabling an administrator to add and/or block registries that are accessed when a user searches for, or 'pulls' a Docker image. The specific change that causes the issue relates to how short image names are used. In the official upstream version of Docker, use of a short name references a repository located on the public Docker Hub Registry, whereas in Red Hat's downstream version, the repository is enumerated based on a pre-defined order of registries, which may not include the Docker Hub Registry at all. These changes may result in entirely different images being referenced by different Docker users issuing the same CLI command or API call, but using Docker hosts that are configured differently.

Red Hat's rationale for the code changes, is predicated on feedback received by their customers. Large corporate users, for example, may have a requirement to block access to public registries in order to maintain its security perimeter, or to safeguard its intellectual property or corporate data. Without the changes, a user might inadvertently make use of a public Docker image for a container against corporate policy, or 'push' an image containing sensitive corporate data to the public Docker Hub Registry.

So, who is right in this disagreement? The answer, of course, is that both are right, and the differences of opinion are manifestly characteristic of open source projects. It's simply not possible to uphold the ideal of every community member's requirements being met by a project, because there will always be conflicts of interest and requirements between community members. But at the same time, a project's value and ultimate success rests on its suitability for meeting the needs of its users - in this case, corporations whose aim is to become more agile through expeditious software application delivery. What is important, is how disagreements like these are resolved. The open source realm is littered with unresolved conflict, which ultimately leads to a dichotomy, as projects get forked into two distinct entities. Sometimes this can happen because of perceived undue corporate influence (e.g. Node.js/io.js), as a result of ideological differences (Debian/Devuan), and probably for many other reasons in between.

Docker and Red Hat, however, need each other in order to succeed in the container world. The two organisations are already formal partners, and Red Hat contributes significantly to the project by way of code, as well as membership of the Docker Governance and Advisory Board (DGAB). Whilst this won't be the first or last commercially inspired dilemma to befall the Docker project, it is arguably one of the more significant, and is currently being resolved behind closed doors. In the meantime, a new Red Hat inspired namespace proposal is being discussed for potential inclusion into the Docker Registry 2.0 code.

Referencing Docker Images

Docker images are the templates that derive the nature and behaviour of a container, and Docker stores these images in repositories on hosted registries. An official registry, called the Docker Hub Registry, is hosted by Docker Inc., which contains:

  • A set of official, certified Docker repositories, which are curated by the Docker community
  • Publicly accessible repositories, provided by any individual or organisation with a Docker Hub account
  • Private repositories for individuals and organisations who purchase one of the available plans provided by Docker Inc

The Docker Hub Registry is an incredibly valuable resource, with over 89,000 publicly available repositories of Docker images. But what if you're a security conscious corporation, that wants to keep your intellectual property proprietary, behind a corporate firewall? Or you're a third-party wanting to provide a value add service to your customers? You have a choice; you can either buy a subscription for the commercially supported Docker Hub Enterprise, or you can deploy your own version of the open source registry inside your firewall.

All of these options, however, pose a serious question - how do I address the correct image that I need for my container? For example, how do I make sure that the MySQL image I use for my application is the one that has been carefully crafted by the Database Administrators inside my organisation, rather than the official MySQL image on the public Docker Hub Registry, or even some other random MySQL image provided by an unknown entity on the Docker Hub Registry? This all comes down to specifying the correct image name when you retrieve an image or invoke a container using the Docker CLI or API, and there is a format that needs to be adhered to. A fully qualified image name (FQIN) consists of three main components; a registry location (with an optional port specification), a username, and a repository name (with an optional tag specification):


The hostname and optional port specify the location of the registry, and if these are omitted then Docker defaults to the Docker Hub Registry at The next element in the image name is a username, and once again, if this is omitted, it corresponds to a special username called library. In the Docker Hub Registry, the library username is for the officially, curated Docker images. Finally, a repository name needs to be specified, and optionally an image tag to identify the specific image from its related images in the repository (if the tag is omitted, Docker assumes the tag latest).

Library Images

In order to 'pull' the latest official Ubuntu image, the following Docker CLI command can be invoked:

docker pull ubuntu

In this format, the registry location, username and tag have been omitted. The shortened image name directs the Docker engine to pull the latest library image from the ubuntu repository on the Docker Hub Registry. This could also have been achieved using the longhand format:

docker pull

User Images

In order to pull the latest version of an image called pxe that belongs to the user jpetazzo on the Docker Hub Registry, the following command can be used:

docker pull jpetazzo/pxe

In this example, the registry location has been omitted, and so the default Docker Hub Registry is the target for the Docker engine.

Images on Third-Party Registries

Some third-party organisations host their own Docker registries independent of Docker Inc, which they make available to their customers. In order to pull an image that resides on a third party registry (such as CoreOS', the registry location needs to be supplied along with the username and repository, e.g.:

docker pull

In this case, a tag has been specified as part of the image name in order to differentiate it from other versions of the image.

Images on Self-Hosted Registries

Finally, we can reference an image that resides on a locally configured, self-hosted registry by specifying the registry location and the repository required:

docker pull

The docker-proxy

Containers created and managed by the Docker platform, are able to provide the service that is running inside the container, not only to other co-located containers, but also to remote hosts. Docker achieves this with port forwarding. For a brief introduction to containers, take a look at a previous article.

When a container starts with its port forwarded to the Docker host on which it runs, in addition to the new process that runs inside the container, you may have noticed an additional process on the Docker host called docker-proxy:

 8006 ?        Sl     0:00 docker-proxy -proto tcp -host-ip -host-port 8000 -container-ip -container-port 8000

The purpose of this process is to enable a service consumer to communicate with the service providing container .... but it's only used in particular circumstances. The docker-proxy operates in userland, and simply receives any packets arriving at the host's specified port, that the kernel hasn't 'dropped' or forwarded, and redirects them to the container's port. The docker-proxy is the same binary as the Docker daemon and Docker client, which the Docker daemon 'reexecs' when it is required.

In order to understand why this process exists, we first need to understand a little about Docker's networking configuration. The default modus operandi for a Docker host is to create a virtual ethernet bridge (called docker0), attach each container's network interface to the bridge, and to use network address translation (NAT) when containers need to make themselves visible to the Docker host and beyond:

Docker Bridge

Controlling access to a container's service is controlled with rules associated with the host's netfilter framework, in both the NAT and filter tables. The general processing flow of packets by netfilter is depicted in this diagram.

If a container's port is to be forwarded to the host as port 8000, then Docker adds some rules to netfilter's NAT table, enabling the container to 'masquerade' as the host using NAT:

Chain PREROUTING (policy ACCEPT 49 packets, 9985 bytes)  
num   pkts bytes target     prot opt in     out       source               destination  
1       80  4152 DOCKER     all  --  *      *              ADDRTYPE match dst-type LOCAL

Chain OUTPUT (policy ACCEPT 1436 packets, 151K bytes)  
num   pkts bytes target     prot opt in     out       source               destination  
1      274 56172 DOCKER     all  --  *      *           !          ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT 1369 packets, 137K bytes)  
num   pkts bytes target     prot opt in     out       source               destination  
1      274 56172 MASQUERADE all  --  *      !docker0  
2        0     0 MASQUERADE tcp  --  *      *            tcp dpt:8000

Chain DOCKER (2 references)  
num   pkts bytes target     prot opt in       out     source               destination  
1        0     0 DNAT       tcp  --  !docker0 *              tcp dpt:8000 to:  

Netfilter is stateful, which means that it can track connections that have already been established, and in such circumstances it bypasses the NAT table rules. But in order for a connection to be established in the first place, packets are subjected to the scrutiny of the rules in the NAT and filter tables.

Packets destined for the host's socket (the container's forwarded port) are processed by netfilter and tested against the rules in the PREROUTING chain of the NAT table, and provided the destination address of a packet is local to the Docker host (which it is), netfilter jumps to the DOCKER chain for further processing. As long as the packet didn't arrive from the ethernet bridge (i.e. from a container), and provided the packet is addressed to TCP port 8000 on the Docker host, then its destination is changed to by the DNAT target - which is the container socket. As the packet needs to be routed to the container, the rules in the FORWARD chain of the filter table are assessed:

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)  
num   pkts bytes target     prot opt in       out      source              destination  
1       63 10326 DOCKER     all  --  *        docker0   
2       50  9618 ACCEPT     all  --  *        docker0             ctstate RELATED,ESTABLISHED  
3       61  5675 ACCEPT     all  --  docker0 !docker0   
4        0     0 ACCEPT     all  --  docker0  docker0            

Chain DOCKER (1 references)  
num   pkts bytes target     prot opt in       out      source              destination  
1        0     0 ACCEPT     tcp  --  !docker0 docker0            tcp dpt:8000  

The first rule applies, which forces a jump to the DOCKER chain, and the single rule in the chain matches the characteristics of the packet, and 'accepts' the packet for forwarding on to the container's socket. Hence, a remote service consuming process thinks it is communicating with the Docker host, but is being serviced by the container instead.

Similarly, when a container initiates a dialogue with a remote service provider, netfilter's NAT POSTROUTING chain changes the source IP address of packets from the container's IP address, to the address of the host's network interface that is responsible for routing the packets to their required destination. This is achieved with netfilter's MASQUERADE target.

A Docker host makes significant use of netfilter rules to aid NAT, and to control access to the containers it hosts, and the docker-proxy mechanism isn't always required. However, there are certain circumstances where this method of control is not available, which is why Docker also creates an instance of the docker-proxy whenever a container's port is forwarded to the Docker host.

Firstly, in order for a remote host to consume a container's service, the Docker host must act like a router, forwarding traffic to the network associated with the ethernet bridge. A Linux host is not normally configured to be a router, so the kernel parameter net.ipv4.ip_forward needs to be set to 1 (net.ipv6.conf.default.forwarding and net.ipv6.conf.all.forwarding for IPv6). Docker takes care of this if its daemon is started with default settings. If, however, the daemon is started with the --ip-forward and/or --iptables command line options set to false, then Docker can't make use of netfilter rules and has to fall back on the docker-proxy. This scenario is probably quite rare, but it is conceivable that some corporate security policies may impose this constraint.

Secondly, even when Docker is able to forward packets using netfilter rules, there is one circumstance where it is not possible to apply netfilter rules. Unless told otherwise, when a container's port is forwarded to the Docker host, it will be forwarded to all of the host's interfaces, including its loopback interface. But the Linux kernel does not allow the routing of loopback traffic, and therefore it's not possible to apply netfilter NAT rules to packets originating from Instead, netfilter sends packets through the filter table's INPUT chain to a local process listening on the designated port - the docker-proxy.

The docker-proxy, then, is a 'catch all' method for allowing container port forwarding to the Docker host. However, it's generally considered that the docker-proxy is an inelegant solution to the problems highlighted above, and when a large range of container ports are exposed, it consumes considerable memory. An attempt was previously made to remove the dependency for the docker-proxy, but this fell foul of the limitations of the aged kernel in RHEL 6.x and CentOS 6.x, which the Docker project feels duty bound to support. Hence, the docker-proxy remains a major constituent part of the Docker experience in all Docker versions up to the current version 1.5. As I write, version 1.6 is due for imminent release, and there have been moves to remove the automatic requirement for the docker-proxy, which I'll cover in another article.

Part 7 - A Basic Container

We've looked at five of the six available namespaces provided by the Linux kernel in a series of previous articles, and we'll take a look at the final namespace, the USER namespace, in a future article. This article looks at how we can combine a number of the namespaces with a specially prepared directory, in which we'll 'jail' our process using the chroot system call. Although our implementation will be missing a few key features that normally accompany container implementations (e.g. cgroups), the resulting environment in which our process will run, can be considered a very rudimentary container of sorts. It isolates the process from several different system resources, and contains the process within a limited filesystem.

The first thing we need to do is to prepare a directory on the host which will become the root filesystem for the container, which will be located at /var/local/jail. We're going to provide just a few binaries for the container to use; env, bash, ps, ls and top.

It's not just a simple matter of copying the binaries to /var/local/jail, each binary relies on shared libraries, and we also need to ensure they are available in the appropriate directory in the container's filesystem. To do this, we can make use of the ldd command, whose purpose is to provide information regarding the shared libraries used by a particular binary. I've created a script called, which takes care of determining the library dependencies for a binary, and then copying them along with the relevant libraries to the correct locations in the container's filesystem. It also copies the file, which is the list of directories that is searched for libraries, in the event that a required library does not reside in /lib or /usr/lib. The script is available on GitHub.

Let's demonstrate this for the env binary, which is located at /usr/bin/env. Having previously created the /var/local/jail directory, the env binary and libraries are copied to the correct location under /var/local/jail with the following:

$ sudo ./ /usr/bin/env /var/local/jail
[sudo] password for wolf:
Copying ...

                      env : [OK]
       : [OK] : [OK]
     : [OK]

...... Done

We can repeat this exercise for the other binaries we intend to use within the container, which, more likely than not, are located in either /usr/bin or /bin. Additionally, so that our commands will display nicely when we run them inside the container, we need to provide the relevant portion of the terminfo database. Assuming we have an xterm, we can copy this into the jail (the location of the terminfo database varies from Linux distro to distro, so make sure to copy the files to the correct directory under /var/local/jail):

$ sudo mkdir -p /var/local/jail/lib/terminfo/x
[sudo] password for wolf:
$ sudo cp -p /lib/terminfo/x/* /var/local/jail/lib/terminfo/x

That's the container's filesystem prepared. Now we need to amend the program we have slowly been developing whilst we've been looking at the properties of containers. The changes are available in the invoke_ns6.c source file, which can be found here.

The first change is to add a new command line option, -c which must be accompanied with a directory path, which will be the root of the jail:

// Parse command line options and construct arguments
// to be passed to childFunction
while ((option = getopt(argc, argv, "+hvpmu:ni:c:")) != -1) {  
    switch (option) {
        case 'c':
            args.jail = 1;
            args.path = malloc(sizeof(char *) * (strlen(optarg) + 1));
            strcpy(args.path, optarg);
        case 'i':
            if (strcmp("no", optarg) != 0 && strcmp("yes", optarg) != 0) {
                fprintf(stderr, "%s: option requires valid argument -- 'i'\n", argv[0]);
                if (strcmp("yes", optarg) == 0)
                    flags |= CLONE_NEWIPC;
            args.ipc = 1;
        case 'n':
            flags |= CLONE_NEWNET;
        case 'u':
            flags |= CLONE_NEWUTS;
            args.hostname = malloc(sizeof(char *) * (strlen(optarg) + 1));
            strcpy(args.hostname, optarg);
        case 'm':
            flags |= CLONE_NEWNS;
        case 'p':
            flags |= CLONE_NEWPID;
        case 'v':
            args.verbose = 1;
        case 'h':

The other main change is to add some code to ensure that if the -c option is supplied, the cloned child process is jailed inside the directory with the chroot system call. The chroot system call changes the root directory of the child process, and we change the current working directory to that root directory, and then create a /proc directory within it:

// If specified, place process in chroot jail
if (args->jail) {  
    if (args->verbose)
        printf(" Child: creating chroot jail\n");
    if (chroot(args->path) == -1) {
        perror(" Child: childFunction: chroot");
    else {
        if (args->verbose)
            printf(" Child: changing directory into chroot jail\n");
        if (chdir("/") == -1) {
            perror(" Child: childFunction: chdir");
        if (access("/proc", F_OK) != 0)
            if (mkdir("/proc", 0555) == -1) {
                perror(" Child: childFunction: mkdir");

We can now invoke our container (with a customised command prompt) with the following command:

$ sudo ./invoke_ns -vpmu calculus -c /var/local/jail env PS1="\h [\W] " TERM=$TERM bash --norc
[sudo] password for wolf:
calculus [/]  

Now that we have an interactive bash command shell running inside the container, we can use the ls, ps and top commands to verify we have a very minimal operating environment, if not a very useful one! It doesn't take much imagination, however, to see the possibilities for containing independent workloads in minimal, lightweight containers.

In reality, a process inside a container needs a few more things than we have provided in our rudimentary version. Thankfully, the excellent work that has been conducted in the open source community with projects like Docker and LXC, have taken the hard work out of creating and manipulating workloads within containers.