Diminishing the Powers of a Container Workload

Managing the privileges available to a container, is important to the ongoing integrity of the container and the host on which it runs. With privilege, comes power, and the potential to abuse that power, wittingly or unwittingly.

A simple container example, serves to illustrate:

$ sudo docker container run -itd --name test alpine sh
$ sudo docker container top test -eo pid
$ ps -fp 26919
UID        PID  PPID  C STIME TTY          TIME CMD  
root      2140  2109  0 10:31 pts/0    00:00:00 sh  

A container is started in detached mode, we retrieve the process ID from the perspective of the default (or host's) PID namespace, list the process, and find that the UID (user ID) associated with the container's process is root. It turns out that the set of UIDs and GIDs (Groups IDs) are the same for the container and the host, because containers are started with the privileged user with UID/GID=0 (aka root or superuser). A big ask, but if the container's process were able to break out of the confines of the container, it would have root access on the host.

There are lots of things we can do to mitigate this risk. Docker removes a lot of potentially, pernicious privileges by dropping capabilities, and applying other security mechanisms in order to minimize the potential attack surface. We can even make use of user namespaces, by configuring the Docker daemon to map a UID/GID range from the host, onto another range in the container. This means a container's process, running as the privileged root user, will map to a non-privileged user on the host.

If you're able to make use of the --userns-remap config option on the daemon, to perform this mapping, you absolutely should. Unfortunately, it's not always possible or desirable to do so - another story, another post! This puts us back to square one; what can we do to minimize the risk? The simple answer is, that we should always be guided by the principle of least privilege. Often, containers need privileges that are associated with the root user, but if they don't, then you should take action to run your containers as a benign user. How do you achieve this?

A Simple Example

Let's take a simple Dockerfile example, which defines a Docker image for the AWS CLI. This use case might be more suited to a local developer's laptop, rather than a sensitive, production-based environment, but it will serve as an illustration. The image enables us to install and run AWS CLI commands in a container, rather than on the host itself:

FROM alpine:latest

# Define build argument for AWS CLI version

# Install dependencies, AWS CLI and clean up.
RUN set -ex                                     && \  
    apk add --no-cache                             \
        python                                     \
        groff                                      \
        less                                       \
        py-pip                                  && \
    pip --no-cache-dir install awscli==$VERSION && \
    apk del py-pip

CMD ["help"]  
ENTRYPOINT ["aws"]  

Assuming the contents of the above are in a file called Dockerfile, located in the current working directory, we can use the docker image build command to build this image. Assuming we have made the local user a member of the group docker, which for convenience, will provide unfettered access to the Docker CLI (something that should only ever be done in a development environment), the following will create the image:

$ docker image build --build-arg VERSION="1.14.38" -t aws:v1 .

We could then check the image works as intended, by running a container derived from the image. This is equivalent to running the command aws --version in a non-containerized environment:

$ docker container run --rm --name aws aws:v1 --version
aws-cli/1.14.38 Python/2.7.14 Linux/4.4.0-112-generic botocore/1.8.42  

This is all well and good, but as we didn't take any action to curtail any privileges, the container ran as the root user, with UID/GID=0. This level of privilege is not necessary to run AWS CLI commands, so let's do something about it!

Using a Non-privileged User

To fix this, we can add a non-privileged user to the image, and then 'set' the user for the image to the non-privileged user, so that a derived container's process, is no longer privileged. The changes to the Dockerfile, might look something like this:

FROM alpine:latest

# Define build argument for AWS CLI version

# Install dependencies, AWS CLI and clean up.
RUN set -ex                                     && \  
    apk add --no-cache                             \
        python                                     \
        groff                                      \
        less                                       \
        py-pip                                  && \
    pip --no-cache-dir install awscli==$VERSION && \
    apk del py-pip                              && \
    addgroup aws                                && \
    adduser -D -G aws aws

USER aws

WORKDIR /home/aws

CMD ["help"]  
ENTRYPOINT ["aws"]  

All we've done, is add two commands to the RUN instruction, to add a group called aws, and to add a user called aws that belongs to the aws group. In order to make use of the aws user, however, we also have to set the user with the USER Dockerfile instruction, and whilst we're at it, we'll set the working context in the filesystem, to its home directory, courtesy of the WORKDIR instruction. We can re-build the image, tagging it as v2 this time:

$ docker image build --build-arg VERSION="1.14.38" -t aws:v2 .

Now that we have a new variant of the aws image, we'll run up a new container, but we'll not specify any command line arguments, which means the argument for the aws command will be help, as specified with the CMD instruction in the Dockerfile:

$ docker container run --rm -it --name aws aws:v2

Unsurprisingly, this will list help for the AWS CLI, which is piped to less, which will give us the opportunity to poke around whilst the container is still running. In another terminal on the host, if we repeat the exercise we carried out earlier, when we looked for the container's process(es), we get the following:

$ docker container top aws -eo pid
$ ps -fp 2436,2487
UID        PID  PPID  C STIME TTY          TIME CMD  
rackham   2436  2407  0 14:27 pts/0    00:00:00 /usr/bin/python2 /usr/bin/aws he  
rackham   2487  2436  0 14:27 pts/0    00:00:00 less -R  

It reports that the processes are running with the UID associated with the user rackham. In actual fact, the UID 1000 is associated with the user rackham on the host, but in the container, the UID 1000 is associated with the user aws:

$ id -u rackham
$ docker container exec -it aws id
uid=1000(aws) gid=1000(aws) groups=1000(aws)  

What really matters is the UID, not the user that it translates to, as the kernel works with the UID when it comes to access control. With the trivial changes made to the image, our container is happily running as a non-privileged user, which should provide us with some peace of mind.

IDs and Bind Mounts

There is something missing from the AWS CLI image, however. In order to do anything meaningful, the AWS CLI commands need access to the user's AWS configuration and credentials, in order to access the AWS API. Obviously, we shouldn't bake these into the image, especially if we intend to share the image with others! We could pass them as environment variables, but whilst this might be a means for injecting configuration items into a container, it's not safe for sensitive data, such as credentials. If you allow others, access to the same Docker daemon, without limiting access using an access authorization plugin, environment variables will be exposed to other users, if they use the docker container inspect command. Another approach would be to bind mount the files containing the relevant data, into the container at run time. In fact, if we want to make use of the aws configure command, to update our local AWS configuration, this is the only way we can update those files, when using a container.

On Linux, AWS config and credentials files are normally located in $HOME/.aws, so we need to bind mount this directory inside the container, at the /home/aws/.aws location of the container's user. We need to do this, each time we want to execute an AWS CLI command using a container. Let's try this out, and try to list the instances running in the default region, which is specified in the AWS config file located in /home/aws/.aws. This command is equivalent to running aws ec2 describe-instances:

$ docker container run --rm -it --mount type=bind,source=$HOME/.aws,target=/home/aws/.aws \
--name aws aws:v2 ec2 describe-instances
You must specify a region. You can also configure your region by running "aws configure"  

That didn't go too well! The error message would suggest that the aws command can't find the files. After we've ascertained that the local user's UID/GID is 1001, if we run another container, and override the container's entrypoint, and run ls -l ./.aws, we can see the reason for the error:

$ id
uid=1001(baxter) gid=1001(baxter) groups=1001(baxter),27(sudo),999(docker)  
$ docker container run --rm -it --mount type=bind,source=$HOME/.aws,target=/home/aws/.aws \
--entrypoint ls --name aws aws:v2 -l ./.aws
total 8  
-rw-------    1 1001     1001           149 Feb 13 16:20 config
-rw-------    1 1001     1001           229 Feb 13 15:42 credentials

The files are present inside the container, but they are owned by UID/GID=1001. Remember, whilst we didn't specify a deterministic UID/GID for the container's user in the image, the addgroup and adduser commands, created the aws user with a UID/GID=1000. There is a mismatch between the UID/GIDs, and the file permissions are such that the container's user cannot read or write to the files.

This is a big problem. We've been careful to ensure that our container runs with diminished privileges, but ended up with a problem to resolve, as a consequence.

We could try and circumvent this problem, by using the --user config option to the docker container run command, and specify the container gets run with a UID/GID=1001, instead of 1000:

$ docker container run --rm -it --mount type=bind,source=$HOME/.aws,target=/home/aws/.aws \
--user 1001:1001 --name aws aws:v2 ec2 describe-instances
You must specify a region. You can also configure your region by running "aws configure".  

This error message is starting to become familiar. The reason, this time, is that there is no 'environment' ($HOME to be precise) for a user with UID/GID=1001, which the AWS CLI needs in order to locate the config and credentials files. This is because there is no user configured in the container's filesystem with UID/GID=1001. We might be tempted to pass a HOME environment variable to the docker container run command, or even to alter the Dockerfile to provide deterministic values for the UID/GID. If we succumb to these seductions, then we're in danger of making the image very specific to a given host, and relying too much on a consumer of our image, to figure out how to make it work around these idiosyncrasies. A better option, would be to add the aws user after the container has been created, which will give us the ability to add the user with the required UID/GID. Let's see how to do this.

Defer Stepping Down to a Non-privileged User

The image for the AWS CLI is immutable, so we can't define a 'variable' aws user in the Dockerfile. Instead, we can make use of an entrypoint script, which will get executed when the container starts. It replaces the aws command, that is specified as the entrypoint in the Dockerfile. Here's a revised Dockerfile:

$ FROM alpine:latest

# Define build time argument for AWS CLI version

# Add default UID for 'aws' user

# Install dependencies, AWS CLI and clean up.
RUN set -ex                                     && \  
    apk add --no-cache                             \
        python                                     \
        groff                                      \
        less                                       \
        py-pip                                     \
        su-exec                                 && \
    pip --no-cache-dir install awscli==$VERSION && \
    apk del py-pip                              && \
    mkdir -p /home/aws

COPY /usr/local/bin/

WORKDIR /home/aws

CMD ["help"]  

In addition to changing the entrypoint, and copying the script from the build context with the COPY instruction, we've added an environment variable specifying a default UID for the aws user (in case the user neglects to do so), removed the commands from the RUN instruction for creating the user, and added a command to create the mount point for the bind mount. We've also added a utility to the image, called su-exec, which will enable our script to step down from the root user to the aws user at the last moment.

Let's get to the entrypoint script, itself:


# If --user is used on command line, cut straight to aws command.
# The command will fail, unless the AWS region and profile have
# been provided as command line arguments or envs.
if [ "$(id -u)" != '0' ]; then  
    exec aws "$@"

# Add 'aws' user using $AWS_UID and $AWS_GID
if [ ! -z "${AWS_GID+x}" ] && [ "$AWS_GID" != "$AWS_UID" ]; then  
    addgroup -g $AWS_GID aws
    adduser -D -G aws -u $AWS_UID aws
    adduser -D -u $AWS_UID aws

# Step down from root to aws, and run command
exec su-exec aws aws "$@"  

When the script is invoked, it is running with the all powerful UID/GID=0, unless the user has invoked the container using the --user config option. As the script needs root privileges to create the aws user, if its invoked with any other user, it won't be possible to create the aws user. Hence, a check is made early on in the script, and if the user associated with the container's process is not UID=0, then we simply use exec to replace the script with the aws command, and any arguments passed at the end of the command which invoked the container (e.g ec2 describe-instances). In this scenario, the command will fail if it is required to provide a default region and credentials.

What we would prefer the user to do instead, is specify an environment variable, AWS_UID (and optionally, AWS_GID), on the command line, which reflects the owner of the AWS config and credentials files on the host. Using this variable, the script will create the aws user with a corresponding UID/GID, before the script is replaced with the desired AWS CLI command, which is executed as the aws user, courtesy of the su-exec utility. First we must re-build the image, and when that's done, let's also create an alias for invoking the AWS CLI container:

$ docker image build --build-arg VERSION="1.14.38" -t aws:v3 .
$ alias aws='docker container run --rm -it --mount type=bind,source=$HOME/.aws,target=/home/aws/.aws --env AWS_UID=$UID --name aws aws:v3'

In the Docker CLI command we've aliased, we've defined the AWS_UID environment variable for use inside the container, which is set to the UID of the user invoking the container. All that's left to do, is test the new configuration, using the alias:

$ aws ec2 describe-instances --query 'Reservations[*].Instances[*].[InstanceId,State.Name]'

This time we're successful!

Stepping down from the root user for our containerized AWS CLI, is a fairly trivial example use case. The technique of stepping down to a non-privileged user in an entrypoint script, however, is very common for applications that require privileges to perform some initialisation, prior to invoking the application associated with the container. You might want to create a database, for example, or apply some configuration based on the characteristics of the host, or the command line arguments provided at run time.


If we hadn't undertaken this exercise to reduce the privileges available inside a container derived from our AWS CLI image, the task of creating the image would have been quite straightforward. However, in taking the time, and expending a little effort, we have taken a considerable step in minimizing the risk of privilege escalation inside the container, which in turn helps to reduce the risk of compromising the host itself. Running containers with a non-privileged user, is one of many steps we can take to secure the containers we run, especially when they are deployed to a production environment.

If you want to find out what else you can do to make your containers more secure, check out my hosted training course - Securing Docker Container Workloads.

Docker Tip: Customising Docker CLI Output


Photo by Andrew Filer

Docker provides a comprehensive API and CLI to its platform. This article is concerned with customising the output returned by Docker CLI commands.

There are a large number of Docker client CLI commands, which provide information relating to various Docker objects on a given Docker host or Swarm cluster. Generally, this output is provided in a tabular format. An example, which all Docker users will have come across, is the docker container ls command, which provides a list of running containers:

$ docker container ls
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS                  NAMES  
43195e559b42        wordpress           "docker-entrypoint..."   47 seconds ago      Up 46 seconds>80/tcp   wp  
f7926468281f        mariadb             "docker-entrypoint..."   2 minutes ago       Up 2 minutes        3306/tcp               mysql  
Customising Command Output

Sometimes, all of this information is too much, and you may find yourself wanting to format the output just how you'd like it. You might want to do this to de-clutter the output, for aesthetic purposes, or for formatting output as input to scripts. This is quite straightforward to do, as a large number of CLI commands have a config option, --format, just for this purpose. The format of the output needs to be specified using a Golang template, which translates a JSON object into the desired format. For example, if we're only interested in the container ID, image, status, exposed ports and name, we could get this with the following (the \t specifies a tab):

$ docker container ls --format '{{.ID}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}\t{{.Names}}'
43195e559b42    wordpress   Up 41 minutes>80/tcp    wp  
f7926468281f    mariadb Up 43 minutes   3306/tcp    mysql  

This provides us with the reduced amount of information we specified, but it looks a bit shoddy. We can add the table directive to improve the look:

$ docker container ls --format 'table {{.ID}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}\t{{.Names}}'
CONTAINER ID        IMAGE               STATUS              PORTS                  NAMES  
43195e559b42        wordpress           Up About an hour>80/tcp   wp  
f7926468281f        mariadb             Up About an hour    3306/tcp               mysql  

Docker actually uses a template applied to a JSON object, to generate the default output you see when no user-defined formatting is applied. The default table format for listing all of the container objects is:

table {{.ID}}\t{{.Image}}\t{{.Command}}\t{{.RunningFor}}\t{{.Status}}\t{{.Ports}}\t{{.Names}}  

These are not the complete set of fields available in the output, however. We can find all of the fields associated with the container object, with:

$ docker container ls --format '{{json .}}' | jq '.'
  "Command": "\"docker-entrypoint...\"",
  "CreatedAt": "2017-07-24 16:23:25 +0100 BST",
  "ID": "43195e559b42",
  "Image": "wordpress",
  "Labels": "",
  "LocalVolumes": "1",
  "Mounts": "c71e998f250e...",
  "Names": "wp",
  "Networks": "wp",
  "Ports": ">80/tcp",
  "RunningFor": "About an hour ago",
  "Size": "0B",
  "Status": "Up About an hour"
  "Command": "\"docker-entrypoint...\"",
  "CreatedAt": "2017-07-24 16:21:33 +0100 BST",
  "ID": "f7926468281f",
  "Image": "mariadb",
  "Labels": "",
  "LocalVolumes": "1",
  "Mounts": "acaa1732009a...",
  "Names": "mysql",
  "Networks": "wp",
  "Ports": "3306/tcp",
  "RunningFor": "About an hour ago",
  "Size": "0B",
  "Status": "Up About an hour"

Notice that there are some keys in each of the objects, missing from the default output; Labels, LocalVolumes, Mounts and Networks, to name a few. Hence, we could customise our output further, by replacing the Status field with the Networks field:

$ docker container ls --format 'table {{.ID}}\t{{.Image}}\t{{.Networks}}\t{{.Ports}}\t{{.Names}}'
CONTAINER ID        IMAGE               NETWORKS            PORTS                  NAMES  
43195e559b42        wordpress           bridgey,wp>80/tcp   wp  
f7926468281f        mariadb             wp                  3306/tcp               mysql  
Making a Customisation Permanent

The --formatconfig option is great, if you want to customise the output in a specific way for a particular use case. It would be a significant PITA, however, if you had to remember this syntax each time you issued a command, if you wanted to perpetually have customised output. You would of course, create an alias, or a script. Docker, however, allows you to make this customisation more permanent, with the use of a configuration file. When a user on a Docker host logs in to the Docker Hub for the very first time, using the docker login command, a file called config.json is created in a directory called .docker in the user's home directory. This file is used by Docker to hold JSON encoded properties, including a user's credentials. It can also be used to hold the format template for the docker container ls command, using the psFormat property. The property is called psFormat, after the old version of the command name, docker ps. A config.json file might look like this:

$ cat config.json
    "auths": {},
    "psFormat": "table {{.ID}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}\t{{.Names}}\t{{.Networks}}"

The psFormat property is the JSON key, whilst the value is the required template for configuring the command output.

With the psFormat property defined, every time you use the docker container ls command, you'll get the customised output you desire. It's possible to override the customisation on a case by case basis, simply by using the --format config option, which takes precedence. Take care when editing the config file; incorrect syntax could render all properties invalid.

Valid Command Customisation Properties

Whilst the output for a large number of commands can be formatted using the --format config option, permanent customisation via a property defined in the config.json file, is mainly reserved for commands listing particular objects. A complete list of the commands, their relevant config property, and default template, are provided in the table below:

Command Property Default Template
docker config ls configFormat table {{.ID}}\t{{.Name}}\t{{.CreatedAt}}\t{{.UpdatedAt}}
docker container ls psFormat table {{.ID}}\t{{.Image}}\t{{.Command}}\t{{.RunningFor}}\t{{.Status}}\t{{.Ports}}\t{{.Names}}
docker image ls imagesFormat table {{.Repository}}\t{{.Tag}}\t{{.ID}}\t{{.CreatedSince}}\t{{.Size}}
docker network ls networksFormat table {{.ID}}\t{{.Name}}\t{{.Driver}}\t{{.Scope}}
docker node ls nodesFormat table {{.ID}} {{if .Self}}*{{else}} {{end}}\t{{.Hostname}}\t{{.Status}}\t{{.Availability}}\t{{.ManagerStatus}}
docker plugin ls pluginsFormat table {{.ID}}\t{{.Name}}\t{{.Description}}\t{{.Enabled}}
docker secret ls secretFormat table {{.ID}}\t{{.Name}}\t{{.CreatedAt}}\t{{.UpdatedAt}}
docker service ls servicesFormat table {{.ID}}\t{{.Name}}\t{{.Mode}}\t{{.Replicas}}\t{{.Image}}\t{{.Ports}}
docker service ps tasksFormat table {{.ID}}\t{{.Name}}\t{{.Image}}\t{{.Node}}\t{{.DesiredState}}\t{{.CurrentState}}\t{{.Error}}\t{{.Ports}}
docker volume ls volumesFormat table {{.Driver}}\t{{.Name}}

The output of a couple of additional Docker CLI commands, can also be defined in the config.json file. The first of these is the format associated with the output of the docker stats command. This command provides rudimentary, real-time resource consumption for running containers, and the statsFormat property allows for customising which metrics are displayed:

Command Property Default Template
docker stats statsFormat table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}\t{{.NetIO}}\t{{.BlockIO}}\t{{.PIDs}}

The second additional property available, is used to format the output associated with the docker service inspect command. Historically, inspect commands, for example docker container inspect, provide JSON output. Docker's maintainers decided that, whilst the docker service inspect command warranted having its output rendered in a more readable format than JSON, they didn't want to break the expected behaviour associated with the inspect commands for other objects. As a compromise, in addition to providing a --pretty config option for the command itself, it's also possible to set the default output to pretty using the serviceInspectProperty in the config.json file:

Command Property Useful Template
docker service inspect serviceInspectFormat pretty

Secrets Come to Docker

Secrets Come to Docker

The provision of secure, authenticated access to sensitive data on IT systems, is an integral component of systems design. The secrets that users or peer IT services employ for accessing sensitive data published by an IT service, come in a variety of guises; passwords, X.509 certificates, SSL/TLS keys, GPG keys, SSH keys and so on. Managing and controlling these secrets in service-oriented environments, is non-trivial. With the continued advance in the adoption of the microservices architecture pattern for software applications, and their common implementation as distributed, immutable containers, this challenge has been exacerbated. How do you de-couple the secret from the template (image) of the container? How do you provide the container with the secret without compromising it? Where will the container be running, so as to provide it with the secret? How do you change the secret without interrupting the consumption of the service?

Docker Engine 1.13.0 introduced a new primary object, the secret, when it was released recently. In conjunction with new API endpoints and CLI commands, the new secret object is designed for handling secrets in a multi-container, multi-node environment - a 'swarm mode' cluster. It is not intended or available for use outside of a swarm mode cluster. Whilst the management of secrets is an oft-requested feature for Docker (particularly in the context of building Docker images), it's unclear if or when a secrets solution will be implemented for the standalone Docker host context. For now, people have been encouraged to use the 'service' abstraction in place of deploying individual containers. This requires bootstrapping a swarm mode cluster, even if it only contains a single node, and the service you deploy only comprises a single task. It's a good job it's as simple as,

$ docker swarm init
How are secrets created?

Creating a secret with the Docker client is a straightforward exercise,

$ < /dev/urandom tr -dc 'a-z0-9' | head -c 32 | docker secret create db_pw -

In this simple example, the Docker CLI reads the content of the secret from STDIN, but it could equally well be a file. The content of the secret can be anything, provided it's size is no more than the secret limit of 500 KB. As with all Docker objects, there are API endpoints and CLI commands for inspecting, listing and removing secrets: docker secret inspect, docker secret ls, docker secret rm. Inspecting the secret provides the following:

$ docker secret inspect db_pw
        "ID": "joptoh9y7x8galitn4ztnk86r",
        "Version": {
            "Index": 44
        "CreatedAt": "2017-01-23T13:52:35.810853263Z",
        "UpdatedAt": "2017-01-23T13:52:35.810853263Z",
        "Spec": {
            "Name": "db_pw"

Inspecting the secret, doesn't (obvs) show you the content of the secret. It shows the creation time of the secret, and whilst the output displays an UpdatedAt key, secrets cannot be updated by the CLI at present. There is, however, an API endpoint for updating secrets.

The Spec key provides some detail about the secret, just the name in the above example. Like most objects in Docker, it is possible to associate labels with secrets when they are created, and labels appear as part of the value of the Spec key.

How are secrets consumed?

Secrets are consumed by services through explicit association. Services are implemented with tasks (individual containers), which can be scheduled on any node within the swarm cluster. If a service comprises of multiple tasks, an associated secret is accessible to any of the tasks, whichever node they are running on.

A service can be created with access to a secret, using the --secret flag:

$ docker service create --name app --secret db_pw my_app:1.0

In addition, a previously created service can be be granted access to an additional secret or have secrets revoked, using the --secret-add and --secret-rm flags used in conjunction with docker service update.

Where are secrets kept?

A swarm mode cluster uses the Raft Consensus Algorithm in order to ensure that nodes participating in the management of the cluster, agree on the state of the cluster. Part of this process involves the replication of the state to all management nodes in the form of a log.

The implementation of secrets in Docker swarm mode, takes advantage of the highly consistent, distributed nature of Raft, by writing secrets to the raft log, which means they are replicated to each of the manager nodes. The Raft log on each manager node is held in memory whilst the cluster is operating, and is encrypted in Docker 1.13.0+.

How does a container access a secret?

A container that is a task associated with a service that has access to a secret, has the secret mounted onto its filesystem under /run/secrets, which is a tmpfs filesystem residing in memory. For example, if the secret is called db_pw, it's available inside the container at /var/run/secrets/db_pw for as long as the container is running (/var/run is a symlink to /run). If the container is halted for any reason, /run/secrets is no longer a component of the container's filesystem, and the secret is also flushed from the hosting node's memory.

The secrets user interface provides some flexibility regarding a service's consumption of the secret. The secret can be mounted with a different name to the one provided during its creation, and its possible to set the UID, GID and mode for the secret. For example, the db_pw secret could be made available inside container tasks with the following attributes:

$ docker service create --name app --secret source=db_pw,target=password,uid=2000,gid=3000,mode=0400 my_app:1.0

Inside the container, this would yield:

root@a61281217232:~# ls -l /var/run/secrets  
total 8  
-r--r--r-- 1 root root 32 Jan 23 11:49 my_secret
-r-------- 1 2000 3000 32 Jan 23 11:49 password
How are secrets updated?

By design, secrets in Docker swarm mode are immutable. If a secret needs to be rotated, it must first be removed from the service, before being replaced with a new secret. The replacement secret can be mounted in the same location. Let's take a look at an example. First we'll create a secret, and use a version number in the secret name, before adding it to a service as password:

$ < /dev/urandom tr -dc 'a-z0-9' | head -c 32 | docker secret create my_secret_v1.0 -
$ docker service create --name nginx --secret source=my_secret_v1.0,target=password nginx

Once the task is running, the secret will be available in the container at /var/run/secrets/password. If the secret is changed, the service can be updated to reflect this:

$ < /dev/urandom tr -dc 'a-z0-9' | head -c 32 | docker secret create my_secret_v1.1 -
$ docker service update --secret-rm my_secret_v1.0 --secret-add source=my_secret_v1.1,target=password nginx

Each service update results in the replacement of existing tasks based on the update policy defined by the --update-parallelism and --update-delay flags (1 and 0s by default, respectively). If the service comprises of multiple tasks, and the update is configured to be applied over a period of time, then some tasks will be using the old secret, whilst the updated tasks will be using the new secret. Clearly, some co-ordination needs to take place between service providers and consumers, when secrets are changed!

After the update, the new secret is available for all tasks that make up the service, and it can be removed (if desired). It can't be removed whilst a service is using the secret:

$ docker secret rm my_secret_v1.0

Introduced in Docker 1.13.0:

  • A new secrets object, along with API endpoints and CLI commands
  • Available in swarm mode only
  • Secrets are stored in the Raft log associated with the swarm cluster
  • Mounted in tmpfs inside a container

Explaining Docker Image IDs

When Docker v1.10 came along, there was a fairly seismic change with the way the Docker Engine handles images. Whilst this was publicised well, and there was little impact on the general usage of Docker (image migration, aside), there were some UI changes which sparked some confusion. So, what was the change, and why does the docker history command show some IDs as <missing>?

$ docker history debian
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT  
1742affe03b5        10 days ago         /bin/sh -c #(nop) CMD ["/bin/bash"]             0 B  
<missing>           10 days ago         /bin/sh -c #(nop) ADD file:5d8521419ad6cfb695   125.1 MB  

First, some background. A docker image is a read-only template for creating containers, and provides a filesystem based on an ordered union of multiple layers of files and directories, which can be shared with other images and containers. Sharing of image layers is a fundamental component of the Docker platform, and is possible through the implementation of a copy-on-write (COW) mechanism. During its lifetime, if a container needs to change a file from the read-only image that provides its filesystem, it copies the file up to its own private read-write layer before making the change.

A layer or 'diff' is created during the Docker image build process, and results when commands are run in a container, which produce new or modified files and directories. These new or modified files and directories are 'committed' as a new layer. The output of the docker history command above shows that the debian image has two layers.

Historical Perspective

Historically (pre Docker v1.10), each time a new layer was created as a result of a commit action, Docker also created a corresponding image, which was identified by a randomly generated 256-bit UUID, usually referred to as an image ID (presented in the UI as either a short 12-digit hex string, or a long 64-digit hex string). Docker stored the layer contents in a directory with a name synonymous with the image ID. Internally, the image consisted of a configuration object, which held the characteristics of the image, including its ID, and the ID of the image's parent image. In this way, Docker was able to construct a filesystem for a container, with each image in turn referencing its parent and the corresponding layer content, until the base image was reached which had no parent. Optionally, each image could also be tagged with a meaningful name (e.g. my_image:1.0), but this was usually reserved for the leaf image. This is depicted in the diagram below:

Using the docker inspect command would yield:

$ docker inspect my_image:1.0
        "Id": "ca1f5f48ef431c0818d5e8797dfe707557bdc728fe7c3027c75de18f934a3b76",
        "Parent": "91bac885982d2d564c0e1869e8b8827c435eead714c06d4c670aaae616c1542c"

This method served Docker well for a sustained period, but over time was perceived to be sub-optimal for a variety of reasons. One of the big drivers for change, came from the lack of a means of detecting whether an image's contents had been tampered with during a push to or pull from a registry, such as the Docker Hub. This led to robust criticism from the community at large, and led to a series of changes, culminating in content addressable IDs.

Content Addressable IDs

Since Docker v1.10, generally, images and layers are no longer synonymous. Instead, an image directly references one or more layers that eventually contribute to a derived container's filesystem.

Layers are now identified by a digest, which takes the form algorithm:hex; for example:


The hex element is calculated by applying the algorithm (SHA256) to a layer's content. If the content changes, then the computed digest will also change, meaning that Docker can check the retrieved contents of a layer with its published digest in order to verify its content. Layers have no notion of an image or of belonging to an image, they are merely collections of files and directories.

A Docker image now consists of a configuration object, which (amongst other things) contains an ordered list of layer digests, which enables the Docker Engine to assemble a container's filesystem with reference to layer digests rather than parent images. The image ID is also a digest, and is a computed SHA256 hash of the image configuration object, which contains the digests of the layers that contribute to the image's filesystem definition. The following diagram depicts the relationship between image and layers post Docker v1.10:
The digests for the image and layers have been shortened for readability.

The diff directory for storing the layer content, is now named after a randomly generated 'cache ID', and the Docker Engine maintains the link between the layer and its cache ID, so that it knows where to locate the layer's content on disk.

So, when a Docker image is pulled from a registry, and the docker history command is used to reveal its contents, the output provides something similar to:

$ docker history swarm
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT  
c54bba046158        9 days ago          /bin/sh -c #(nop) CMD ["--help"]                0 B  
<missing>           9 days ago          /bin/sh -c #(nop) ENTRYPOINT &{["/swarm"]}      0 B  
<missing>           9 days ago          /bin/sh -c #(nop) VOLUME [/.swarm]              0 B  
<missing>           9 days ago          /bin/sh -c #(nop) EXPOSE 2375/tcp               0 B  
<missing>           9 days ago          /bin/sh -c #(nop) ENV SWARM_HOST=:2375          0 B  
<missing>           9 days ago          /bin/sh -c #(nop) COPY dir:b76b2255a3b423981a   0 B  
<missing>           9 days ago          /bin/sh -c #(nop) COPY file:5acf949e76228329d   277.2 kB  
<missing>           9 days ago          /bin/sh -c #(nop) COPY file:a2157cec2320f541a   19.06 MB  

The command provides detail about the image and the layers it is composed of. The <missing> value in the IMAGE field for all but one of the layers of the image, is misleading and a little unfortunate. It conveys the suggestion of an error, but there is no error as layers are no longer synonymous with a corresponding image and ID. I think it would have been more appropriate to have left the field blank. Also, the image ID appears to be associated with the uppermost layer, but in fact, the image ID doesn't 'belong' to any of the layers. Rather, the layers collectively belong to the image, and provide its filesystem definition.

Locally Built Images

Whilst this narrative for content addressable images holds true for all Docker images post Docker v1.10, locally built images on a Docker host are treated slightly differently. The generic content of an image built locally remains the same - it is a configuration object containing configuration items, including an ordered list of layer digests.

However, when a layer is committed during an image build on a local Docker host, an 'intermediate' image is created at the same time. Just like all other images, it has a configuration item which is a list of the layer digests that are to be incorporated as part of the image, and its ID or digest contains a hash of the configuration object. Intermediate images aren't tagged with a name, but, they do have a 'Parent' key, which contains the ID of the parent image.

The purpose of the intermediate images and the reference to parent images, is to facilitate the use of Docker's build cache. The build cache is another important feature of the Docker platform, and is used to help the Docker Engine make use of pre-existing layer content, rather than regenerating the content needlessly for an identical build command. It makes the build process more efficient. When an image is built locally, the docker history command might provide output similar to the following:

$ docker history jbloggs/my_image:latest 
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT  
26cca5b0c787        52 seconds ago      /bin/sh -c #(nop) CMD ["/bin/sh" "-c" "/bin/b   0 B  
97e47fb9e0a6        52 seconds ago      /bin/sh -c apt-get update &&     apt-get inst   16.98 MB  
1742affe03b5        13 days ago         /bin/sh -c #(nop) CMD ["/bin/bash"]             0 B  
<missing>           13 days ago         /bin/sh -c #(nop) ADD file:5d8521419ad6cfb695   125.1 MB  

In this example, the top two layers are created during the local image build, whilst the bottom layers came from the base image for the build (e.g. Dockerfile instruction FROM debian). We can use the docker inspect command to review the layer digests associated with the image:

$ docker inspect jboggs/my_image:latest 
        "RootFS": {
            "Type": "layers",
            "Layers": [

The docker history command shows the image as having four layers, but docker inspect suggests just three layers. This is because the two CMD instructions produce metadata for the image, don't add any content, and therefore the 'diff' is empty. The digest 5f70bf18a08a is the SHA256 hash of an empty layer, and is shared by both of the layers in question.

When a locally built image is pushed to a registry, it is only the leaf image that is uploaded along with its constituent layers, and a subsequent pull by another Docker host will not yield any intermediate parent images. This is because once the image is made available to other potential users on different Docker hosts via a registry, it effectively becomes read-only, and the components that support the build cache are no longer required. Instead of the image ID, <missing> is inserted into the history output in its place.

Pushing the image to a registry might yield:

$ docker push jbloggs/my_image:latest
The push refers to a repository []  
f22bfbc1df82: Pushed  
5f70bf18a086: Layer already exists  
4dcab49015d4: Layer already exists  
latest: digest: sha256:7f63e3661b1377e2658e458ac1ff6d5e0079f0cfd9ff2830786d1b45ae1bb820 size: 3147  

In this example, only one layer has been pushed, as two of the layers already exist in the registry, referenced by one or more other images which use the same content.

A Final Twist

The digests that Docker uses for layer 'diffs' on a Docker host, contain the sha256 hash of the tar archived content of the diff. Before the layer is uploaded to a registry as part of a push, it is compressed for bandwidth efficiency. A manifest is also created to describe the contents of the image, and it contains the digests of the compressed layer content. Consequently, the digests for the layers in the manifest are different to those generated in their uncompressed state. The manifest is also pushed to the registry.

The digest of a compressed layer diff can be referred to as a 'distribution digest', whilst the digest for the uncompressed layer diff can be referred to as a 'content digest'. Hence, when we pull our example image on a different Docker host, the docker pull command gives the following output:

$ docker pull jbloggs/my_image
Using default tag: latest  
latest: Pulling from jbloggs/my_image

51f5c6a04d83: Pull complete  
a3ed95caeb02: Pull complete  
9a246d793396: Pull complete  
Digest: sha256:7f63e3661b1377e2658e458ac1ff6d5e0079f0cfd9ff2830786d1b45ae1bb820  
Status: Downloaded newer image for jbloggs/my_image:latest  

The distribution digests in the output of the docker pull command, are very different to the digests reported by the docker push command. But, the pull will decompress the layers, and the output of a docker inspect command will provide the familiar content digests that we saw after the image build.


Following the changes to image and layer handling in Docker v1.10:

  • A Docker image provides a filesystem for a derived container based on the references it stores to layer diffs
  • Layer diffs are referenced using a digest, which contains an SHA256 hash of an archive of the diff's contents
  • A Docker image's ID is a digest, which contains an SHA256 hash of the image's JSON configuration object
  • Docker creates intermediate images during a local image build, for the purposes of maintaining a build cache
  • An image manifest is created and pushed to a Docker registry when an image is pushed
  • An image manifest contains digests of the image's layers, which contain the SHA256 hashes of the compressed, archived diff contents

Docker Overlay Networking

Everyone knows that in the early days of the Docker platform's existence, more emphasis was placed on the Dev side of the DevOps equation. Effectively, that meant that Docker provided a good experience for developing software applications, but a sub-optimal one for running those applications in production. No more so, than with the native networking capabilities provided, that limited inter-container communication to a local Docker host (unless you employed some creative glue and sticky tape).

That all changed with Docker's acquisition of SocketPlane, and the subsequent release of Docker 1.9 in November 2015. The team from SocketPlane helped to completely overhaul the platform's networking capabilities, with the introduction of a networking library called libnetwork. Libnetwork implements Docker's Container Network Model (CNM), and via its API, specific networking drivers provide container networking capabilities based on the CNM abstraction. Docker has in-built drivers, but also supports third party plugin drivers, such as Weave Net and Calico.

One of the in-built drivers is the overlay driver, which provides one of the hitherto most sought after features - cross-host Docker networking for containers. It's based on the VXLAN principle, which encapsulates layer 2 ethernet frames in layer 4 (UDP) packets to enable overlay networking. Let's see how to set this up in Docker.

To demonstrate the use of overlay networks in Docker, I'll use a variation of Dj Walker-Morgan's goredchat application, a simple chat application that uses the Redis database engine to register chat users, and for routing chat messages. We'll create a Redis instance, and two client sessions using goredchat (all running in containers), but on different Docker hosts and connected via an overlay network.

Establish a Key/Value Store

The first thing we need to do is establish a key/value store that Docker's overlay networking requires - it's used to hold network state. We will use HashiCorp's Consul key/value store (other choices are CoreOS' etcd and Apache Software Foundation's Zookeeper), and the easiest way to do this is to run it in a container on a dedicated VM using Virtualbox. To simplify things, we'll use Docker Machine to create the VM.

Create the VM:

$ docker-machine create -d virtualbox kv-store

Next, we need to point our Docker client at the Docker daemon running on the kv-store VM:

$ eval $(docker-machine env kv-store)

Now we need to start a container running Consul, and we'll use the popular progrium/consul Docker image from the Docker Hub. The container needs some ports forwarded to its VM host, and can be started with the following command:

$ docker run -it -d --restart unless-stopped -p 8400:8400 -p 8500:8500 \
> -p 8600:53/udp -h consul progrium/consul -server -bootstrap

Consul will run in the background, and will be available for storing key/value pairs relating to the state of Docker overlay networks for Docker hosts using the store. When one or more Docker hosts are configured to make use of a key/value store, they often do so as part of a cluster arrangement, but being part of a formal cluster (e.g. Docker Swarm) is not a requirement for participating in overlay networks.

Create Three Additional Docker Hosts

Next, we'll create three more VMs, each running a Docker daemon, which we'll use to host containers that will be connected to the overlay network. Each of the Docker daemons running on these machine needs to be made aware of the KV store, and of each other. To achieve this, we need to configure each Docker daemon with the --cluster-store and --cluster-advertise configuration options, which need to supplied via the Docker Machine --engine-opt configuration option:

$ for i in {1..3}; do docker-machine create -d virtualbox \
> --engine-opt “cluster-store consul://$(docker-machine ip kv-store):8500” \
> --engine-opt “cluster-advertise eth1:2376” \
> host0${i}; done

To see if all the VMs are running as expected:

$ docker-machine ls -f "table {{.Name}}  \t{{.State}}\t{{.URL}}\t{{.DockerVersion}}"
NAME         STATE     URL                         DOCKER  
host01       Running   tcp://   v1.10.2  
host02       Running   tcp://   v1.10.2  
host03       Running   tcp://   v1.10.2  
kv-store     Running   tcp://   v1.10.2  
Create an Overlay Network

We now need to run some Docker CLI commands on each of the Docker hosts we have created. There are numerous ways of doing this;

  1. Establish an ssh session on the VM in question, using the docker-machine ssh command
  2. Point the local Docker client at the Docker host in question, using the docker-machine env command
  3. Run one-time commands against the particular Docker host using the docker-machine config command

The overlay network can be created using any of the Docker hosts:

$ docker $(docker-machine config host01) network create -d overlay my_overlay

We can check that the overlay network my_overlay can be seen from each of the Docker hosts (replacing host01 for each individual host):

$ docker $(docker-machine config host01) network ls -f name=my_overlay
NETWORK ID          NAME                DRIVER  
0223fc182bd3        my_overlay          overlay  
Create a Container Running the Redis KV Database Engine

Having created the overlay network, we now need to start a Redis server running in a container on Docker host host01, using the library image found on the Docker Hub registry.

$ docker $(docker-machine config host01) run -d --restart unless-stopped \
> --net-alias redis_svr --net my_overlay redis:alpine redis-server \
> --appendonly yes

The library Redis image will be pulled from the Docker Hub registry, and the Docker CLI will return the container ID, e.g.


In order to connect the container to the my_overlay network, the docker run client command is given the --net my_overlay configuration option, along with --net-alias redis_svr, which provides a network specific alias for the container, redis_svr. The alias can be used to lookup the container. The Redis library image exposes port 6379, which will be accessible to containers connected to the my_overlay network. To test whether the redis_svr container is listening for connections, we can run the Redis client in an ephemeral container, also on host01:

$ docker $(docker-machine config host01) run --rm --net my_overlay \
> redis:alpine redis-cli -h redis_svr ping

Notice that Docker's embedded DNS server resolves the redis_svr name, and the Redis server responds to the Redis client ping, with a PONG.

Create a Container Running goredchat on host02

Now that we have established that a Redis server container is connected to the my_overlay network and listening on port 6379, we can attempt to consume its service from a container running on a different Docker host.

The goredchat image can be found on the Docker Hub registry, and can be run like a binary, with command line options. To find out how it functions, I can run the following command:

$ docker $(docker-machine config host02) run --rm --net my_overlay nbrown/goredchat --help
Usage: /goredchat [-r URL] username  
  e.g. /goredchat -r redis://redis_svr:6379 antirez

  If -r URL is not used, the REDIS_URL env must be set instead

Now let's run a container using the -r configuration option to address the Redis server. If you're re-creating these steps in your own environment, don't forget to add the -it configuration options to enable you to interact with the container:

$ docker $(docker-machine config host02) run -it --rm --net my_overlay \
> nbrown/goredchat -r redis://redis_svr:6379 bill

Welcome to goredchat bill! Type /who to see who's online, /exit to exit.  

I can find out who's online by typing /who:


The goredchat client has created a TCP socket connection from the container on host02 to the Redis server container on host01 over the my_overlay network, and queried the Redis database engine.

Create a Container Running goredchat on host03

In another bash command shell, we can start another instance of goredchat, this time on host03.

$ docker $(docker-machine config host03) run -it --rm --net my_overlay \
> nbrown/goredchat -r redis://redis_svr:6379 brian

Welcome to goredchat brian! Type /who to see who's online, /exit to exit.


Brian and Bill can chat through the goredchat client, which uses a Redis server for subscribing and publishing to a message channel. All components of the chat service run in containers, on different Docker hosts, but connected to the same overlay network.

Docker's new networking capabilitities have some detractors, but libnetwork is a significant step forward, is evolving, and is designed to support multiple use cases, whilst maintaining a consistent and familiar user experience.