Docker

Referencing Docker Images

Docker images are the templates that derive the nature and behaviour of a container, and Docker stores these images in repositories on hosted registries. An official registry, called the Docker Hub Registry, is hosted by Docker Inc., which contains:

  • A set of official, certified Docker repositories, which are curated by the Docker community
  • Publicly accessible repositories, provided by any individual or organisation with a Docker Hub account
  • Private repositories for individuals and organisations who purchase one of the available plans provided by Docker Inc

The Docker Hub Registry is an incredibly valuable resource, with over 89,000 publicly available repositories of Docker images. But what if you're a security conscious corporation, that wants to keep your intellectual property proprietary, behind a corporate firewall? Or you're a third-party wanting to provide a value add service to your customers? You have a choice; you can either buy a subscription for the commercially supported Docker Hub Enterprise, or you can deploy your own version of the open source registry inside your firewall.

All of these options, however, pose a serious question - how do I address the correct image that I need for my container? For example, how do I make sure that the MySQL image I use for my application is the one that has been carefully crafted by the Database Administrators inside my organisation, rather than the official MySQL image on the public Docker Hub Registry, or even some other random MySQL image provided by an unknown entity on the Docker Hub Registry? This all comes down to specifying the correct image name when you retrieve an image or invoke a container using the Docker CLI or API, and there is a format that needs to be adhered to. A fully qualified image name (FQIN) consists of three main components; a registry location (with an optional port specification), a username, and a repository name (with an optional tag specification):

hostname[:port]/username/reponame[:tag]

The hostname and optional port specify the location of the registry, and if these are omitted then Docker defaults to the Docker Hub Registry at index.docker.io. The next element in the image name is a username, and once again, if this is omitted, it corresponds to a special username called library. In the Docker Hub Registry, the library username is for the officially, curated Docker images. Finally, a repository name needs to be specified, and optionally an image tag to identify the specific image from its related images in the repository (if the tag is omitted, Docker assumes the tag latest).

Library Images

In order to 'pull' the latest official Ubuntu image, the following Docker CLI command can be invoked:

docker pull ubuntu

In this format, the registry location, username and tag have been omitted. The shortened image name directs the Docker engine to pull the latest library image from the ubuntu repository on the Docker Hub Registry. This could also have been achieved using the longhand format:

docker pull index.docker.io/library/ubuntu:latest

User Images

In order to pull the latest version of an image called pxe that belongs to the user jpetazzo on the Docker Hub Registry, the following command can be used:

docker pull jpetazzo/pxe

In this example, the registry location has been omitted, and so the default Docker Hub Registry is the target for the Docker engine.

Images on Third-Party Registries

Some third-party organisations host their own Docker registries independent of Docker Inc, which they make available to their customers. In order to pull an image that resides on a third party registry (such as CoreOS' Quay.io), the registry location needs to be supplied along with the username and repository, e.g.:

docker pull quay.io/signalfuse/zookeeper:3.4.5-3

In this case, a tag has been specified as part of the image name in order to differentiate it from other versions of the image.

Images on Self-Hosted Registries

Finally, we can reference an image that resides on a locally configured, self-hosted registry by specifying the registry location and the repository required:

docker pull internal.mycorp.com:5000/revealjs:latest

The docker-proxy

Containers created and managed by the Docker platform, are able to provide the service that is running inside the container, not only to other co-located containers, but also to remote hosts. Docker achieves this with port forwarding. For a brief introduction to containers, take a look at a previous article.

When a container starts with its port forwarded to the Docker host on which it runs, in addition to the new process that runs inside the container, you may have noticed an additional process on the Docker host called docker-proxy:

  PID TTY      STAT   TIME COMMAND
 8006 ?        Sl     0:00 docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 8000 -container-ip 172.17.0.2 -container-port 8000

The purpose of this process is to enable a service consumer to communicate with the service providing container .... but it's only used in particular circumstances. The docker-proxy operates in userland, and simply receives any packets arriving at the host's specified port, that the kernel hasn't 'dropped' or forwarded, and redirects them to the container's port. The docker-proxy is the same binary as the Docker daemon and Docker client, which the Docker daemon 'reexecs' when it is required.

In order to understand why this process exists, we first need to understand a little about Docker's networking configuration. The default modus operandi for a Docker host is to create a virtual ethernet bridge (called docker0), attach each container's network interface to the bridge, and to use network address translation (NAT) when containers need to make themselves visible to the Docker host and beyond:

Docker Bridge

Controlling access to a container's service is controlled with rules associated with the host's netfilter framework, in both the NAT and filter tables. The general processing flow of packets by netfilter is depicted in this diagram.

If a container's port 172.17.0.2:8000 is to be forwarded to the host as port 8000, then Docker adds some rules to netfilter's NAT table, enabling the container to 'masquerade' as the host using NAT:

Chain PREROUTING (policy ACCEPT 49 packets, 9985 bytes)  
num   pkts bytes target     prot opt in     out       source               destination  
1       80  4152 DOCKER     all  --  *      *         0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

Chain OUTPUT (policy ACCEPT 1436 packets, 151K bytes)  
num   pkts bytes target     prot opt in     out       source               destination  
1      274 56172 DOCKER     all  --  *      *         0.0.0.0/0           !127.0.0.0/8          ADDRTYPE match dst-type LOCAL

Chain POSTROUTING (policy ACCEPT 1369 packets, 137K bytes)  
num   pkts bytes target     prot opt in     out       source               destination  
1      274 56172 MASQUERADE all  --  *      !docker0  172.17.0.0/16        0.0.0.0/0  
2        0     0 MASQUERADE tcp  --  *      *         172.17.0.2           172.17.0.2           tcp dpt:8000

Chain DOCKER (2 references)  
num   pkts bytes target     prot opt in       out     source               destination  
1        0     0 DNAT       tcp  --  !docker0 *       0.0.0.0/0            0.0.0.0/0            tcp dpt:8000 to:172.17.0.2:8000  

Netfilter is stateful, which means that it can track connections that have already been established, and in such circumstances it bypasses the NAT table rules. But in order for a connection to be established in the first place, packets are subjected to the scrutiny of the rules in the NAT and filter tables.

Packets destined for the host's socket (the container's forwarded port) are processed by netfilter and tested against the rules in the PREROUTING chain of the NAT table, and provided the destination address of a packet is local to the Docker host (which it is), netfilter jumps to the DOCKER chain for further processing. As long as the packet didn't arrive from the ethernet bridge (i.e. from a container), and provided the packet is addressed to TCP port 8000 on the Docker host, then its destination is changed to 172.17.0.2:8000 by the DNAT target - which is the container socket. As the packet needs to be routed to the container, the rules in the FORWARD chain of the filter table are assessed:

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)  
num   pkts bytes target     prot opt in       out      source              destination  
1       63 10326 DOCKER     all  --  *        docker0  0.0.0.0/0           0.0.0.0/0  
2       50  9618 ACCEPT     all  --  *        docker0  0.0.0.0/0           0.0.0.0/0            ctstate RELATED,ESTABLISHED  
3       61  5675 ACCEPT     all  --  docker0 !docker0  0.0.0.0/0           0.0.0.0/0  
4        0     0 ACCEPT     all  --  docker0  docker0  0.0.0.0/0           0.0.0.0/0           

Chain DOCKER (1 references)  
num   pkts bytes target     prot opt in       out      source              destination  
1        0     0 ACCEPT     tcp  --  !docker0 docker0  0.0.0.0/0           172.17.0.2           tcp dpt:8000  

The first rule applies, which forces a jump to the DOCKER chain, and the single rule in the chain matches the characteristics of the packet, and 'accepts' the packet for forwarding on to the container's socket. Hence, a remote service consuming process thinks it is communicating with the Docker host, but is being serviced by the container instead.

Similarly, when a container initiates a dialogue with a remote service provider, netfilter's NAT POSTROUTING chain changes the source IP address of packets from the container's IP address, to the address of the host's network interface that is responsible for routing the packets to their required destination. This is achieved with netfilter's MASQUERADE target.

A Docker host makes significant use of netfilter rules to aid NAT, and to control access to the containers it hosts, and the docker-proxy mechanism isn't always required. However, there are certain circumstances where this method of control is not available, which is why Docker also creates an instance of the docker-proxy whenever a container's port is forwarded to the Docker host.

Firstly, in order for a remote host to consume a container's service, the Docker host must act like a router, forwarding traffic to the network associated with the ethernet bridge. A Linux host is not normally configured to be a router, so the kernel parameter net.ipv4.ip_forward needs to be set to 1 (net.ipv6.conf.default.forwarding and net.ipv6.conf.all.forwarding for IPv6). Docker takes care of this if its daemon is started with default settings. If, however, the daemon is started with the --ip-forward and/or --iptables command line options set to false, then Docker can't make use of netfilter rules and has to fall back on the docker-proxy. This scenario is probably quite rare, but it is conceivable that some corporate security policies may impose this constraint.

Secondly, even when Docker is able to forward packets using netfilter rules, there is one circumstance where it is not possible to apply netfilter rules. Unless told otherwise, when a container's port is forwarded to the Docker host, it will be forwarded to all of the host's interfaces, including its loopback interface. But the Linux kernel does not allow the routing of loopback traffic, and therefore it's not possible to apply netfilter NAT rules to packets originating from 127.0.0.0/8. Instead, netfilter sends packets through the filter table's INPUT chain to a local process listening on the designated port - the docker-proxy.

The docker-proxy, then, is a 'catch all' method for allowing container port forwarding to the Docker host. However, it's generally considered that the docker-proxy is an inelegant solution to the problems highlighted above, and when a large range of container ports are exposed, it consumes considerable memory. An attempt was previously made to remove the dependency for the docker-proxy, but this fell foul of the limitations of the aged kernel in RHEL 6.x and CentOS 6.x, which the Docker project feels duty bound to support. Hence, the docker-proxy remains a major constituent part of the Docker experience in all Docker versions up to the current version 1.5. As I write, version 1.6 is due for imminent release, and there have been moves to remove the automatic requirement for the docker-proxy, which I'll cover in another article.

Part 7 - A Basic Container

We've looked at five of the six available namespaces provided by the Linux kernel in a series of previous articles, and we'll take a look at the final namespace, the USER namespace, in a future article. This article looks at how we can combine a number of the namespaces with a specially prepared directory, in which we'll 'jail' our process using the chroot system call. Although our implementation will be missing a few key features that normally accompany container implementations (e.g. cgroups), the resulting environment in which our process will run, can be considered a very rudimentary container of sorts. It isolates the process from several different system resources, and contains the process within a limited filesystem.

The first thing we need to do is to prepare a directory on the host which will become the root filesystem for the container, which will be located at /var/local/jail. We're going to provide just a few binaries for the container to use; env, bash, ps, ls and top.

It's not just a simple matter of copying the binaries to /var/local/jail, each binary relies on shared libraries, and we also need to ensure they are available in the appropriate directory in the container's filesystem. To do this, we can make use of the ldd command, whose purpose is to provide information regarding the shared libraries used by a particular binary. I've created a script called binlibdepcp.sh, which takes care of determining the library dependencies for a binary, and then copying them along with the relevant libraries to the correct locations in the container's filesystem. It also copies the ld.so.cache file, which is the list of directories that is searched for libraries, in the event that a required library does not reside in /lib or /usr/lib. The script is available on GitHub.

Let's demonstrate this for the env binary, which is located at /usr/bin/env. Having previously created the /var/local/jail directory, the env binary and libraries are copied to the correct location under /var/local/jail with the following:

$ sudo ./binlibdepcp.sh /usr/bin/env /var/local/jail
[sudo] password for wolf:
Copying ...

                      env : [OK]
                libc.so.6 : [OK]
     ld-linux-x86-64.so.2 : [OK]
              ld.so.cache : [OK]

...... Done

We can repeat this exercise for the other binaries we intend to use within the container. Additionally, so that our commands will display nicely when we run them inside the container, we need to provide the relevant portion of the terminfo database. Assuming we have an xterm, we can copy this into the jail:

$ sudo mkdir -p /var/local/jail/lib/terminfo/x
[sudo] password for wolf:
$ sudo cp -p /lib/terminfo/x/* /var/local/jail/lib/terminfo/x

That's the container's filesystem prepared. Now we need to amend the program we have slowly been developing whilst we've been looking at the properties of containers. The changes are available in the invoke_ns6.c source file.

The first change is to add a new command line option, -c which must be accompanied with a directory path, which will be the root of the jail:

// Parse command line options and construct arguments
// to be passed to childFunction
while ((option = getopt(argc, argv, "+hvpmu:ni:c:")) != -1) {  
    switch (option) {
    case 'c':
        args.jail = 1;
        args.path = malloc(sizeof(char *) * (strlen(optarg) + 1));
        strcpy(args.path, optarg);
        break;
    case 'i':
        if (strcmp("no", optarg) != 0 && strcmp("yes", optarg) != 0) {
            fprintf(stderr, "%s: option requires valid argument -- 'i'\n", argv[0]);
            usage(argv[0]);
            exit(EXIT_FAILURE);
        }
        else
            if (strcmp("yes", optarg) == 0)
                flags |= CLONE_NEWIPC;
        args.ipc = 1;
        break;
    case 'n':
        flags |= CLONE_NEWNET;
        break;
    case 'u':
        flags |= CLONE_NEWUTS;
        args.hostname = malloc(sizeof(char *) * (strlen(optarg) + 1));
        strcpy(args.hostname, optarg);
        break;
    case 'm':
        flags |= CLONE_NEWNS;
        break;
    case 'p':
        flags |= CLONE_NEWPID;
        break;
    case 'v':
        args.verbose = 1;
        break;
    case 'h':
        usage(argv[0]);
        exit(EXIT_SUCCESS);
    default:
        usage(argv[0]);
        exit(EXIT_FAILURE);
    }
}

The other main change is to add some code to ensure that if the -c option is supplied, the cloned child process is jailed inside the directory with the chroot system call. The chroot system call changes the root directory of the child process, and we change the current working directory to that root directory, and then create a /proc directory within it:

// If specified, place process in chroot jail
if (args->jail) {  
    if (args->verbose)
        printf(" Child: creating chroot jail\n");
    if (chroot(args->path) == -1) {
        perror(" Child: chroot");
        exit(EXIT_FAILURE);
    }
    else {
        if (args->verbose)
            printf(" Child: changing directory into chroot jail\n");
        if (chdir("/") == -1) {
            perror(" Child: chdir");
            exit(EXIT_FAILURE);
        }
        if (access("/proc", F_OK) != 0)
            if (mkdir("/proc", 0700) == -1) {
               perror(" Child: mkdir");
                exit(EXIT_FAILURE);
        }
    }
}

We can now invoke our container (with a customised command prompt) with the following command:

$ sudo ./invoke_ns -vpmu calculus -c /var/local/jail \
env PS1="\[\e[34m\]\h\[\e[m\] [\[\e[31m\]\W\[\e[m\]] " bash --norc  
[sudo] password for wolf:
calculus [/]  

Now that we have an interactive bash command shell running inside the container, we can use the ls, ps and top commands to verify we have a very minimal operating environment, if not a very useful one! It doesn't take much imagination, however, to see the possibilities for containing independent workloads in minimal, lightweight containers.

In reality, a process inside a container needs a few more things than we have provided in our rudimentary version. Thankfully, the excellent work that has been conducted in the open source community with projects like Docker and LXC, have taken the hard work out of creating and manipulating workloads within containers.

Container Mania

Nearly 60 years after Malcolm McLean developed the intermodal shipping container for the transportation of goods, it seems its computing equivalent has arrived with considerable interest, much expression, and with some occasional controversy.

Containers are a form of virtualisation, but unlike the traditional virtual machine, they are very light in terms of footprint and resource usage. Applications running in containers don't need a full blown guest operating system to function, they just need the bare minimum in terms of OS binaries and libraries, and share the host's kernel with other containers and the host itself. The light nature of containers, and the very quick speeds with which container workloads can be provisioned, can however, be contrasted with a reduction in the level of isolation when compared to a traditional virtual machine, and they are currently only available on the Linux OS. Equivalent capabilities exist in other *nix operating systems, such as Solaris (Zones) and FreeBSD (Jails), but not the Windows platform .... yet. When it comes to choosing whether to plump for containers or virtual machines, it's a matter of horses for courses.

So, why now? What has provoked the current interest and activity? The sudden popularity of containers has much to do with technological maturity, inspired innovation, and an evolving need.

Maturity:
Whilst some aspects of the technology that provides Linux containers has been around for a number of years, it's true to say that the 'total package' has only recently matured to a level where its inherent in the kernels shipped with most Linux distributions.

Innovation:
Containers are an abstraction of Linux kernel namespaces and control groups (or cgroups), and as such it requires some effort and knowledge on the part of the user to create and invoke a container. This has inhibited their take up as a means of isolating workloads. Enter stage left, the likes of Docker (libcontainer library), Rocket, LXC and lmctfy, all of which serve to commodotise the container. Docker, in particular, has captured the hearts and minds of the Devops community, with its platform for delivering distributed applications in containers.

Need:
Containers are a perfect fit for a recent trend in architecting software applications as small, discrete, independent microservices. Whilst there is no formal definition of a microservice, it is generally considered that a microservice is a highly de-coupled, independent process with a specific function, which often communicates via a RESTful HTTP API. It's entirely possible to use containers to run multiple processes (as is the case with LXC), but the approach taken by Docker and Rocket is to encourage a single process per container, fitting neatly with the microservice aspiration.

The fact that all major cloud and operating system vendors, including Microsoft, are busy developing their capabilities regarding containers, is evidence enough that containers will have a big part to play in workload deployment in the coming years. This means the stakes are high for the organisations behind the different technologies, which has led to some differences of opinion. On the whole, however, the majority of the technologies are being developed in the open, using a community-based model, which should significantly aid continued innovation, maturity, and adoption.

This article serves as an introduction to a series of articles that examine the fundamental building blocks for containers; namespaces and cgroups.