Docker

What's in a name?

For those unfamiliar with Docker, it's a platform for creating and running isolated workloads in containers based on a template image. The image repositories are stored in registries, which are retrieved using a fully qualified image name (FQIN) as part of a Docker CLI command or API call, as explained here. Docker's method of addressing repositories all seems very logical, but, just recently a situation arose which resulted in some disagreement between project contributors.

Everyone knows that Red Hat is a major platform and cloud provider, as well as having aspirations in the embryonic container market for workloads; it is a partner of Docker Inc, and a significant contributor to the Docker project. It's not unusual for Red Hat to adopt open source technologies, and to make a value-added product available to their customers as a downstream distribution. This is what it did with OpenStack, for example. However, when Red Hat provided its customers with access to Docker 1.5 via its repositories, it came with some 'experimental' code changes not available in the upstream Docker release. This in itself may not have been a big issue, but the nature of the changes caused Docker Inc's CTO and the Docker project's Chief Architect, Solomon Hykes, to take issue. In his view, the changes broke one of the project's architectural principles; that is, that the user should expect the same result when using a CLI command or API call to retrieve images, wherever they might issue the command or call.

In short, Red Hat's code changes provided two new command line options for the Docker engine, enabling an administrator to add and/or block registries that are accessed when a user searches for, or 'pulls' a Docker image. The specific change that causes the issue relates to how short image names are used. In the official upstream version of Docker, use of a short name references a repository located on the public Docker Hub Registry, whereas in Red Hat's downstream version, the repository is enumerated based on a pre-defined order of registries, which may not include the Docker Hub Registry at all. These changes may result in entirely different images being referenced by different Docker users issuing the same CLI command or API call, but using Docker hosts that are configured differently.

Red Hat's rationale for the code changes, is predicated on feedback received by their customers. Large corporate users, for example, may have a requirement to block access to public registries in order to maintain its security perimeter, or to safeguard its intellectual property or corporate data. Without the changes, a user might inadvertently make use of a public Docker image for a container against corporate policy, or 'push' an image containing sensitive corporate data to the public Docker Hub Registry.

So, who is right in this disagreement? The answer, of course, is that both are right, and the differences of opinion are manifestly characteristic of open source projects. It's simply not possible to uphold the ideal of every community member's requirements being met by a project, because there will always be conflicts of interest and requirements between community members. But at the same time, a project's value and ultimate success rests on its suitability for meeting the needs of its users - in this case, corporations whose aim is to become more agile through expeditious software application delivery. What is important, is how disagreements like these are resolved. The open source realm is littered with unresolved conflict, which ultimately leads to a dichotomy, as projects get forked into two distinct entities. Sometimes this can happen because of perceived undue corporate influence (e.g. Node.js/io.js), as a result of ideological differences (Debian/Devuan), and probably for many other reasons in between.

Docker and Red Hat, however, need each other in order to succeed in the container world. The two organisations are already formal partners, and Red Hat contributes significantly to the project by way of code, as well as membership of the Docker Governance and Advisory Board (DGAB). Whilst this won't be the first or last commercially inspired dilemma to befall the Docker project, it is arguably one of the more significant, and is currently being resolved behind closed doors. In the meantime, a new Red Hat inspired namespace proposal is being discussed for potential inclusion into the Docker Registry 2.0 code.

Referencing Docker Images

Docker images are the templates that derive the nature and behaviour of a container, and Docker stores these images in repositories on hosted registries. An official registry, called the Docker Hub Registry, is hosted by Docker Inc., which contains:

  • A set of official, certified Docker repositories, which are curated by the Docker community
  • Publicly accessible repositories, provided by any individual or organisation with a Docker Hub account
  • Private repositories for individuals and organisations who purchase one of the available plans provided by Docker Inc

The Docker Hub Registry is an incredibly valuable resource, with over 89,000 publicly available repositories of Docker images. But what if you're a security conscious corporation, that wants to keep your intellectual property proprietary, behind a corporate firewall? Or you're a third-party wanting to provide a value add service to your customers? You have a choice; you can either buy a subscription for the commercially supported Docker Hub Enterprise, or you can deploy your own version of the open source registry inside your firewall.

All of these options, however, pose a serious question - how do I address the correct image that I need for my container? For example, how do I make sure that the MySQL image I use for my application is the one that has been carefully crafted by the Database Administrators inside my organisation, rather than the official MySQL image on the public Docker Hub Registry, or even some other random MySQL image provided by an unknown entity on the Docker Hub Registry? This all comes down to specifying the correct image name when you retrieve an image or invoke a container using the Docker CLI or API, and there is a format that needs to be adhered to. A fully qualified image name (FQIN) consists of three main components; a registry location (with an optional port specification), a username, and a repository name (with an optional tag specification):

hostname[:port]/username/reponame[:tag]

The hostname and optional port specify the location of the registry, and if these are omitted then Docker defaults to the Docker Hub Registry at index.docker.io. The next element in the image name is a username, and once again, if this is omitted, it corresponds to a special username called library. In the Docker Hub Registry, the library username is for the officially, curated Docker images. Finally, a repository name needs to be specified, and optionally an image tag to identify the specific image from its related images in the repository (if the tag is omitted, Docker assumes the tag latest).

Library Images

In order to 'pull' the latest official Ubuntu image, the following Docker CLI command can be invoked:

docker pull ubuntu

In this format, the registry location, username and tag have been omitted. The shortened image name directs the Docker engine to pull the latest library image from the ubuntu repository on the Docker Hub Registry. This could also have been achieved using the longhand format:

docker pull index.docker.io/library/ubuntu:latest

User Images

In order to pull the latest version of an image called pxe that belongs to the user jpetazzo on the Docker Hub Registry, the following command can be used:

docker pull jpetazzo/pxe

In this example, the registry location has been omitted, and so the default Docker Hub Registry is the target for the Docker engine.

Images on Third-Party Registries

Some third-party organisations host their own Docker registries independent of Docker Inc, which they make available to their customers. In order to pull an image that resides on a third party registry (such as CoreOS' Quay.io), the registry location needs to be supplied along with the username and repository, e.g.:

docker pull quay.io/signalfuse/zookeeper:3.4.5-3

In this case, a tag has been specified as part of the image name in order to differentiate it from other versions of the image.

Images on Self-Hosted Registries

Finally, we can reference an image that resides on a locally configured, self-hosted registry by specifying the registry location and the repository required:

docker pull internal.mycorp.com:5000/revealjs:latest

The docker-proxy

Containers created and managed by the Docker platform, are able to provide the service that is running inside the container, not only to other co-located containers, but also to remote hosts. Docker achieves this with port forwarding. For a brief introduction to containers, take a look at a previous article.

When a container starts with its port forwarded to the Docker host on which it runs, in addition to the new process that runs inside the container, you may have noticed an additional process on the Docker host called docker-proxy:

  PID TTY      STAT   TIME COMMAND
 8006 ?        Sl     0:00 docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8000 -container-ip 172.17.0.2 -container-port 8000

The purpose of this process is to enable a service consumer to communicate with the service providing container .... but it's only used in particular circumstances. The docker-proxy operates in userland, and simply receives any packets arriving at the host's specified port, that the kernel hasn't 'dropped' or forwarded, and redirects them to the container's port. The docker-proxy is the same binary as the Docker daemon and Docker client, which the Docker daemon 'reexecs' when it is required.

In order to understand why this process exists, we first need to understand a little about Docker's networking configuration. The default modus operandi for a Docker host is to create a virtual ethernet bridge (called docker0), attach each container's network interface to the bridge, and to use network address translation (NAT) when containers need to make themselves visible to the Docker host and beyond:

Docker Bridge

Controlling access to a container's service is controlled with rules associated with the host's netfilter framework, in both the NAT and filter tables. The general processing flow of packets by netfilter is depicted in this diagram.

If a container's port 172.17.0.2:8000 is to be forwarded to the host as port 8000, then Docker adds some rules to netfilter's NAT table, enabling the container to 'masquerade' as the host using NAT:

Chain PREROUTING (policy ACCEPT 49 packets, 9985 bytes)  
num   pkts bytes target     prot opt in     out       source               destination  
1       80  4152 DOCKER     all  --  *      *         0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain OUTPUT (policy ACCEPT 1436 packets, 151K bytes)  
num   pkts bytes target     prot opt in     out       source               destination  
1      274 56172 DOCKER     all  --  *      *         0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT 1369 packets, 137K bytes)  
num   pkts bytes target     prot opt in     out       source               destination  
1      274 56172 MASQUERADE all  --  *      !docker0  172.17.0.0/16        0.0.0.0/0  
2        0     0 MASQUERADE tcp  --  *      *         172.17.0.2           172.17.0.2           tcp dpt:8000

Chain DOCKER (2 references)  
num   pkts bytes target     prot opt in       out     source               destination  
1        0     0 DNAT       tcp  --  !docker0 *       0.0.0.0/0            0.0.0.0/0            tcp dpt:8000 to:172.17.0.2:8000  

Netfilter is stateful, which means that it can track connections that have already been established, and in such circumstances it bypasses the NAT table rules. But in order for a connection to be established in the first place, packets are subjected to the scrutiny of the rules in the NAT and filter tables.

Packets destined for the host's socket (the container's forwarded port) are processed by netfilter and tested against the rules in the PREROUTING chain of the NAT table, and provided the destination address of a packet is local to the Docker host (which it is), netfilter jumps to the DOCKER chain for further processing. As long as the packet didn't arrive from the ethernet bridge (i.e. from a container), and provided the packet is addressed to TCP port 8000 on the Docker host, then its destination is changed to 172.17.0.2:8000 by the DNAT target - which is the container socket. As the packet needs to be routed to the container, the rules in the FORWARD chain of the filter table are assessed:

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)  
num   pkts bytes target     prot opt in       out      source              destination  
1       63 10326 DOCKER     all  --  *        docker0  0.0.0.0/0           0.0.0.0/0  
2       50  9618 ACCEPT     all  --  *        docker0  0.0.0.0/0           0.0.0.0/0            ctstate RELATED,ESTABLISHED  
3       61  5675 ACCEPT     all  --  docker0 !docker0  0.0.0.0/0           0.0.0.0/0  
4        0     0 ACCEPT     all  --  docker0  docker0  0.0.0.0/0           0.0.0.0/0           

Chain DOCKER (1 references)  
num   pkts bytes target     prot opt in       out      source              destination  
1        0     0 ACCEPT     tcp  --  !docker0 docker0  0.0.0.0/0           172.17.0.2           tcp dpt:8000  

The first rule applies, which forces a jump to the DOCKER chain, and the single rule in the chain matches the characteristics of the packet, and 'accepts' the packet for forwarding on to the container's socket. Hence, a remote service consuming process thinks it is communicating with the Docker host, but is being serviced by the container instead.

Similarly, when a container initiates a dialogue with a remote service provider, netfilter's NAT POSTROUTING chain changes the source IP address of packets from the container's IP address, to the address of the host's network interface that is responsible for routing the packets to their required destination. This is achieved with netfilter's MASQUERADE target.

A Docker host makes significant use of netfilter rules to aid NAT, and to control access to the containers it hosts, and the docker-proxy mechanism isn't always required. However, there are certain circumstances where this method of control is not available, which is why Docker also creates an instance of the docker-proxy whenever a container's port is forwarded to the Docker host.

Firstly, in order for a remote host to consume a container's service, the Docker host must act like a router, forwarding traffic to the network associated with the ethernet bridge. A Linux host is not normally configured to be a router, so the kernel parameter net.ipv4.ip_forward needs to be set to 1 (net.ipv6.conf.default.forwarding and net.ipv6.conf.all.forwarding for IPv6). Docker takes care of this if its daemon is started with default settings. If, however, the daemon is started with the --ip-forward and/or --iptables command line options set to false, then Docker can't make use of netfilter rules and has to fall back on the docker-proxy. This scenario is probably quite rare, but it is conceivable that some corporate security policies may impose this constraint.

Secondly, even when Docker is able to forward packets using netfilter rules, there is one circumstance where it is not possible to apply netfilter rules. Unless told otherwise, when a container's port is forwarded to the Docker host, it will be forwarded to all of the host's interfaces, including its loopback interface. But the Linux kernel does not allow the routing of loopback traffic, and therefore it's not possible to apply netfilter NAT rules to packets originating from 127.0.0.0/8. Instead, netfilter sends packets through the filter table's INPUT chain to a local process listening on the designated port - the docker-proxy.

The docker-proxy, then, is a 'catch all' method for allowing container port forwarding to the Docker host. However, it's generally considered that the docker-proxy is an inelegant solution to the problems highlighted above, and when a large range of container ports are exposed, it consumes considerable memory. An attempt was previously made to remove the dependency for the docker-proxy, but this fell foul of the limitations of the aged kernel in RHEL 6.x and CentOS 6.x, which the Docker project feels duty bound to support. Hence, the docker-proxy remains a major constituent part of the Docker experience in all Docker versions up to the current version 1.5. As I write, version 1.6 is due for imminent release, and there have been moves to remove the automatic requirement for the docker-proxy, which I'll cover in another article.

Part 7 - A Basic Container

We've looked at five of the six available namespaces provided by the Linux kernel in a series of previous articles, and we'll take a look at the final namespace, the USER namespace, in a future article. This article looks at how we can combine a number of the namespaces with a specially prepared directory, in which we'll 'jail' our process using the chroot system call. Although our implementation will be missing a few key features that normally accompany container implementations (e.g. cgroups), the resulting environment in which our process will run, can be considered a very rudimentary container of sorts. It isolates the process from several different system resources, and contains the process within a limited filesystem.

The first thing we need to do is to prepare a directory on the host which will become the root filesystem for the container, which will be located at /var/local/jail. We're going to provide just a few binaries for the container to use; env, bash, ps, ls and top.

It's not just a simple matter of copying the binaries to /var/local/jail, each binary relies on shared libraries, and we also need to ensure they are available in the appropriate directory in the container's filesystem. To do this, we can make use of the ldd command, whose purpose is to provide information regarding the shared libraries used by a particular binary. I've created a script called binlibdepcp.sh, which takes care of determining the library dependencies for a binary, and then copying them along with the relevant libraries to the correct locations in the container's filesystem. It also copies the ld.so.cache file, which is the list of directories that is searched for libraries, in the event that a required library does not reside in /lib or /usr/lib. The script is available on GitHub.

Let's demonstrate this for the env binary, which is located at /usr/bin/env. Having previously created the /var/local/jail directory, the env binary and libraries are copied to the correct location under /var/local/jail with the following:

$ sudo ./binlibdepcp.sh /usr/bin/env /var/local/jail
[sudo] password for wolf:
Copying ...

                      env : [OK]
                libc.so.6 : [OK]
     ld-linux-x86-64.so.2 : [OK]
              ld.so.cache : [OK]

...... Done

We can repeat this exercise for the other binaries we intend to use within the container, which, more likely than not, are located in either /usr/bin or /bin. Additionally, so that our commands will display nicely when we run them inside the container, we need to provide the relevant portion of the terminfo database. Assuming we have an xterm, we can copy this into the jail (the location of the terminfo database varies from Linux distro to distro, so make sure to copy the files to the correct directory under /var/local/jail):

$ sudo mkdir -p /var/local/jail/lib/terminfo/x
[sudo] password for wolf:
$ sudo cp -p /lib/terminfo/x/* /var/local/jail/lib/terminfo/x

That's the container's filesystem prepared. Now we need to amend the program we have slowly been developing whilst we've been looking at the properties of containers. The changes are available in the invoke_ns6.c source file, which can be found here.

The first change is to add a new command line option, -c which must be accompanied with a directory path, which will be the root of the jail:

// Parse command line options and construct arguments
// to be passed to childFunction
while ((option = getopt(argc, argv, "+hvpmu:ni:c:")) != -1) {  
    switch (option) {
        case 'c':
            args.jail = 1;
            args.path = malloc(sizeof(char *) * (strlen(optarg) + 1));
            strcpy(args.path, optarg);
            break;
        case 'i':
            if (strcmp("no", optarg) != 0 && strcmp("yes", optarg) != 0) {
                fprintf(stderr, "%s: option requires valid argument -- 'i'\n", argv[0]);
                usage(argv[0]);
                exit(EXIT_FAILURE);
            }
            else
                if (strcmp("yes", optarg) == 0)
                    flags |= CLONE_NEWIPC;
            args.ipc = 1;
            break;
        case 'n':
            flags |= CLONE_NEWNET;
            break;
        case 'u':
            flags |= CLONE_NEWUTS;
            args.hostname = malloc(sizeof(char *) * (strlen(optarg) + 1));
            strcpy(args.hostname, optarg);
            break;
        case 'm':
            flags |= CLONE_NEWNS;
            break;
        case 'p':
            flags |= CLONE_NEWPID;
            break;
        case 'v':
            args.verbose = 1;
            break;
        case 'h':
            usage(argv[0]);
            exit(EXIT_SUCCESS);
        default:
            usage(argv[0]);
            exit(EXIT_FAILURE);
    }
}

The other main change is to add some code to ensure that if the -c option is supplied, the cloned child process is jailed inside the directory with the chroot system call. The chroot system call changes the root directory of the child process, and we change the current working directory to that root directory, and then create a /proc directory within it:

// If specified, place process in chroot jail
if (args->jail) {  
    if (args->verbose)
        printf(" Child: creating chroot jail\n");
    if (chroot(args->path) == -1) {
        perror(" Child: childFunction: chroot");
        exit(EXIT_FAILURE);
    }
    else {
        if (args->verbose)
            printf(" Child: changing directory into chroot jail\n");
        if (chdir("/") == -1) {
            perror(" Child: childFunction: chdir");
            exit(EXIT_FAILURE);
        }
        if (access("/proc", F_OK) != 0)
            if (mkdir("/proc", 0555) == -1) {
                perror(" Child: childFunction: mkdir");
                exit(EXIT_FAILURE);
            }
    }
}

We can now invoke our container (with a customised command prompt) with the following command:

$ sudo ./invoke_ns -vpmu calculus -c /var/local/jail env PS1="\h [\W] " TERM=$TERM bash --norc
[sudo] password for wolf:
calculus [/]  

Now that we have an interactive bash command shell running inside the container, we can use the ls, ps and top commands to verify we have a very minimal operating environment, if not a very useful one! It doesn't take much imagination, however, to see the possibilities for containing independent workloads in minimal, lightweight containers.

In reality, a process inside a container needs a few more things than we have provided in our rudimentary version. Thankfully, the excellent work that has been conducted in the open source community with projects like Docker and LXC, have taken the hard work out of creating and manipulating workloads within containers.

Container Mania

Nearly 60 years after Malcolm McLean developed the intermodal shipping container for the transportation of goods, it seems its computing equivalent has arrived with considerable interest, much expression, and with some occasional controversy.

Containers are a form of virtualisation, but unlike the traditional virtual machine, they are very light in terms of footprint and resource usage. Applications running in containers don't need a full blown guest operating system to function, they just need the bare minimum in terms of OS binaries and libraries, and share the host's kernel with other containers and the host itself. The light nature of containers, and the very quick speeds with which container workloads can be provisioned, can however, be contrasted with a reduction in the level of isolation when compared to a traditional virtual machine, and they are currently only available on the Linux OS. Equivalent capabilities exist in other *nix operating systems, such as Solaris (Zones) and FreeBSD (Jails), but not the Windows platform .... yet. When it comes to choosing whether to plump for containers or virtual machines, it's a matter of horses for courses.

So, why now? What has provoked the current interest and activity? The sudden popularity of containers has much to do with technological maturity, inspired innovation, and an evolving need.

Maturity:
Whilst some aspects of the technology that provides Linux containers has been around for a number of years, it's true to say that the 'total package' has only recently matured to a level where its inherent in the kernels shipped with most Linux distributions.

Innovation:
Containers are an abstraction of Linux kernel namespaces and control groups (or cgroups), and as such it requires some effort and knowledge on the part of the user to create and invoke a container. This has inhibited their take up as a means of isolating workloads. Enter stage left, the likes of Docker (libcontainer library), Rocket, LXC and lmctfy, all of which serve to commodotise the container. Docker, in particular, has captured the hearts and minds of the Devops community, with its platform for delivering distributed applications in containers.

Need:
Containers are a perfect fit for a recent trend in architecting software applications as small, discrete, independent microservices. Whilst there is no formal definition of a microservice, it is generally considered that a microservice is a highly de-coupled, independent process with a specific function, which often communicates via a RESTful HTTP API. It's entirely possible to use containers to run multiple processes (as is the case with LXC), but the approach taken by Docker and Rocket is to encourage a single process per container, fitting neatly with the microservice aspiration.

The fact that all major cloud and operating system vendors, including Microsoft, are busy developing their capabilities regarding containers, is evidence enough that containers will have a big part to play in workload deployment in the coming years. This means the stakes are high for the organisations behind the different technologies, which has led to some differences of opinion. On the whole, however, the majority of the technologies are being developed in the open, using a community-based model, which should significantly aid continued innovation, maturity, and adoption.

This article serves as an introduction to a series of articles that examine the fundamental building blocks for containers; namespaces and cgroups.