Namespaces

Part 7 - A Basic Container

We've looked at five of the six available namespaces provided by the Linux kernel in a series of previous articles, and we'll take a look at the final namespace, the USER namespace, in a future article. This article looks at how we can combine a number of the namespaces with a specially prepared directory, in which we'll 'jail' our process using the chroot system call. Although our implementation will be missing a few key features that normally accompany container implementations (e.g. cgroups), the resulting environment in which our process will run, can be considered a very rudimentary container of sorts. It isolates the process from several different system resources, and contains the process within a limited filesystem.

The first thing we need to do is to prepare a directory on the host which will become the root filesystem for the container, which will be located at /var/local/jail. We're going to provide just a few binaries for the container to use; env, bash, ps, ls and top.

It's not just a simple matter of copying the binaries to /var/local/jail, each binary relies on shared libraries, and we also need to ensure they are available in the appropriate directory in the container's filesystem. To do this, we can make use of the ldd command, whose purpose is to provide information regarding the shared libraries used by a particular binary. I've created a script called binlibdepcp.sh, which takes care of determining the library dependencies for a binary, and then copying them along with the relevant libraries to the correct locations in the container's filesystem. It also copies the ld.so.cache file, which is the list of directories that is searched for libraries, in the event that a required library does not reside in /lib or /usr/lib. The script is available on GitHub.

Let's demonstrate this for the env binary, which is located at /usr/bin/env. Having previously created the /var/local/jail directory, the env binary and libraries are copied to the correct location under /var/local/jail with the following:

$ sudo ./binlibdepcp.sh /usr/bin/env /var/local/jail
[sudo] password for wolf:
Copying ...

                      env : [OK]
                libc.so.6 : [OK]
     ld-linux-x86-64.so.2 : [OK]
              ld.so.cache : [OK]

...... Done

We can repeat this exercise for the other binaries we intend to use within the container, which, more likely than not, are located in either /usr/bin or /bin. Additionally, so that our commands will display nicely when we run them inside the container, we need to provide the relevant portion of the terminfo database. Assuming we have an xterm, we can copy this into the jail (the location of the terminfo database varies from Linux distro to distro, so make sure to copy the files to the correct directory under /var/local/jail):

$ sudo mkdir -p /var/local/jail/lib/terminfo/x
[sudo] password for wolf:
$ sudo cp -p /lib/terminfo/x/* /var/local/jail/lib/terminfo/x

That's the container's filesystem prepared. Now we need to amend the program we have slowly been developing whilst we've been looking at the properties of containers. The changes are available in the invoke_ns6.c source file, which can be found here.

The first change is to add a new command line option, -c which must be accompanied with a directory path, which will be the root of the jail:

// Parse command line options and construct arguments
// to be passed to childFunction
while ((option = getopt(argc, argv, "+hvpmu:ni:c:")) != -1) {  
    switch (option) {
        case 'c':
            args.jail = 1;
            args.path = malloc(sizeof(char *) * (strlen(optarg) + 1));
            strcpy(args.path, optarg);
            break;
        case 'i':
            if (strcmp("no", optarg) != 0 && strcmp("yes", optarg) != 0) {
                fprintf(stderr, "%s: option requires valid argument -- 'i'\n", argv[0]);
                usage(argv[0]);
                exit(EXIT_FAILURE);
            }
            else
                if (strcmp("yes", optarg) == 0)
                    flags |= CLONE_NEWIPC;
            args.ipc = 1;
            break;
        case 'n':
            flags |= CLONE_NEWNET;
            break;
        case 'u':
            flags |= CLONE_NEWUTS;
            args.hostname = malloc(sizeof(char *) * (strlen(optarg) + 1));
            strcpy(args.hostname, optarg);
            break;
        case 'm':
            flags |= CLONE_NEWNS;
            break;
        case 'p':
            flags |= CLONE_NEWPID;
            break;
        case 'v':
            args.verbose = 1;
            break;
        case 'h':
            usage(argv[0]);
            exit(EXIT_SUCCESS);
        default:
            usage(argv[0]);
            exit(EXIT_FAILURE);
    }
}

The other main change is to add some code to ensure that if the -c option is supplied, the cloned child process is jailed inside the directory with the chroot system call. The chroot system call changes the root directory of the child process, and we change the current working directory to that root directory, and then create a /proc directory within it:

// If specified, place process in chroot jail
if (args->jail) {  
    if (args->verbose)
        printf(" Child: creating chroot jail\n");
    if (chroot(args->path) == -1) {
        perror(" Child: childFunction: chroot");
        exit(EXIT_FAILURE);
    }
    else {
        if (args->verbose)
            printf(" Child: changing directory into chroot jail\n");
        if (chdir("/") == -1) {
            perror(" Child: childFunction: chdir");
            exit(EXIT_FAILURE);
        }
        if (access("/proc", F_OK) != 0)
            if (mkdir("/proc", 0555) == -1) {
                perror(" Child: childFunction: mkdir");
                exit(EXIT_FAILURE);
            }
    }
}

We can now invoke our container (with a customised command prompt) with the following command:

$ sudo ./invoke_ns -vpmu calculus -c /var/local/jail env PS1="\h [\W] " TERM=$TERM bash --norc
[sudo] password for wolf:
calculus [/]  

Now that we have an interactive bash command shell running inside the container, we can use the ls, ps and top commands to verify we have a very minimal operating environment, if not a very useful one! It doesn't take much imagination, however, to see the possibilities for containing independent workloads in minimal, lightweight containers.

In reality, a process inside a container needs a few more things than we have provided in our rudimentary version. Thankfully, the excellent work that has been conducted in the open source community with projects like Docker and LXC, have taken the hard work out of creating and manipulating workloads within containers.

Part 6 - IPC Namespace

The IPC namespace is used for isolating System V IPC objects, and POSIX message queues. The clone flag used to achieve this is CLONE_NEWIPC. We've adapted our program from previous articles, to create a POSIX message queue, which will be used to pass a message between a process and its cloned child process.

To demonstrate how the IPC namespace establishes isolation of a POSIX message queue, we'll create a message queue in the parent process, and attempt to open it from within the child process, in order to send a message to the the parent process. We'll do this once when the cloned child process remains in the global IPC namespace, and then when it is created within a new IPC namespace.

So, first off, we need to embellish the program, which is available here (invoke_ns5.c), with a new command line option, -i. The -i option must be accompanied with a string of either 'yes' or 'no'. If 'yes' is specified, the child process will be cloned in a new IPC namespace, and if 'no' is specified, it remains in the global IPC namespace.

The command line parsing now looks like this:

// Parse command line options and construct arguments
// to be passed to childFunction
while ((option = getopt(argc, argv, "+hvpmu:ni:")) != -1) {  
    switch (option) {
        case 'i':
            if (strcmp("no", optarg) != 0 && strcmp("yes", optarg) != 0) {
                fprintf(stderr, "%s: option requires valid argument -- 'i'\n", argv[0]);
                usage(argv[0]);
                exit(EXIT_FAILURE);
            }
            else
                if (strcmp("yes", optarg) == 0)
                    flags |= CLONE_NEWIPC;
            args.ipc = 1;
            break;
        case 'n':
            flags |= CLONE_NEWNET;
            break;
        case 'u':
            flags |= CLONE_NEWUTS;
            args.hostname = malloc(sizeof(char) * (strlen(optarg) + 1));
            strcpy(args.hostname, optarg);
            break;
        case 'm':
            flags |= CLONE_NEWNS;
            break;
        case 'p':
            flags |= CLONE_NEWPID;
            break;
        case 'v':
            args.verbose = 1;
            break;
        case 'h':
            usage(argv[0]);
            exit(EXIT_SUCCESS);
        default:
            usage(argv[0]);
            exit(EXIT_FAILURE);
    }
}

We also need to define the message queue name:

// Define message queue name
const char *mq_name = "/ipc_namespace";  

If the -i command line option has been specified, before making the call to clone, the parent calls the prepareMQ function:

// Prepare message queue
if (args.ipc) {  
    mq = prepareMQ(&args);
    if (mq == -1)
        exit(EXIT_FAILURE);
}

The prepareMQ function creates a message queue (in the global IPC namespace), and opens it for reading and writing, with a maximum message size of 81 bytes (an extra byte for termination):

// Prepare message queue
mqd_t prepareMQ(void *child_args)  
{
    struct arguments *args = child_args;
    mqd_t            mq;
    int              oflags = O_CREAT | O_RDWR;
    struct mq_attr   attr;

    attr.mq_flags   = 0;
    attr.mq_maxmsg  = 10;
    attr.mq_msgsize = 81;
    attr.mq_curmsgs = 0;

    mq = mq_open(mq_name, oflags, 0644, &attr);
    if (mq != -1) {
        if (args->verbose)
            printf("Parent: opening message queue %s\n", mq_name);
    }
    else
        perror("Parent: prepareMQ: mq_open");

    return mq;
}

Once the message queue has been established, the child process is created with the clone system call, and then the parent waits 60 seconds for a message from the child process, before timing out. If a message is received from the child process, the parent process echoes this to stdout:

// Read message from child on message queue
if (args.ipc) {  
    if (clock_gettime(CLOCK_REALTIME, &timeout) == 0) {
        timeout.tv_sec += 60;
        if (mq_getattr(mq, &attr) != -1) {
            msg = malloc(attr.mq_msgsize);
            if (mq_timedreceive(mq, msg, attr.mq_msgsize, NULL, &timeout) != -1) {
                if (args.verbose)
                    printf("Parent: received message from child\n");
                printf("\n    Parent: the following message was received from the child\n     >> %s\n\n", msg);
            }
            else
                perror("Parent: main: mq_timedreceive");
            free(msg);
        }
        else
            perror("Parent: main: mq_getattr");
    }
    else
        perror("Parent: main: clock_gettime");
}

Once the child process has terminated, the parent closes and removes the message queue:

// Remove message queue
if (args.ipc) {  
    if (args.verbose)
        printf("Parent: closing message queue %s\n", mq_name);
    if (mq_close(mq) == -1)
        perror("Parent: main: mq_close");
    if (args.verbose)
        printf("Parent: removing message queue %s\n", mq_name);
    if (mq_unlink(mq_name) == -1)
        perror("Parent: main: mq_unlink");
}

With regard to the cloned child, we have added some additional code to the childFunction, which gets executed when the child process is created. If the -i command line option has been specified, the child attempts to open the message queue created by the parent, and prompts the user to enter a message, which is then sent to the message queue. If the child has got this far, it closes the message queue and continues to process the other namespace requirements:

// Send message to parent if -i option provided
if (args->ipc) {  
    if (args->verbose)
        printf(" Child: opening message queue %s\n", mq_name);
    mq = mq_open(mq_name, oflags);
    if (mq != -1) {
        if (mq_getattr(mq, &attr) != -1) {
            msg = malloc(attr.mq_msgsize);
            printf("\n     Child: enter a message to send to the parent process (MAX 80 chars)\n     >> ");
            if (fgets(msg, attr.mq_msgsize, stdin) != NULL) {
                msg[strcspn(msg, "\n")] = '\0';
                printf("\n");
                if (args->verbose)
                    printf(" Child: sending message to parent\n");
                if (mq_send(mq, msg, attr.mq_msgsize,0) == -1)
                    perror(" Child: childFunction: mq_send");
            }
            else
                perror(" Child: childFunction: fgets");
            if (args->verbose)
                printf(" Child: closing message queue %s\n", mq_name);
            if (mq_close(mq) == -1)
                perror(" Child: childFunction: mq_close");
            free(msg);
        }
        else
            perror(" Child: childFunction: mq_getattr");
    }
    else
        perror(" Child: childFunction: mq_open");
}

Now let's see what happens when we execute the program. First, we need to compile it, remembering to specify -lrt in order to link the POSIX real-time library into the executable, which contains the message queue API. In the first instance we'll specify -i no, which means the cloned child process remains in the same IPC namespace as its parent, the global IPC namespace:

$ sudo ./invoke_ns -vi no
[sudo] password for wolf:
Parent: PID of parent is 11944  
Parent: opening message queue /ipc_namespace  
Parent: PID of child is 11945  
 Child: PID of child is 11945
 Child: opening message queue /ipc_namespace

     Child: enter a message to send to the parent process (MAX 80 chars)
     >>

At this point, we need to type a message of up to 80 characters, perhaps ....

I will return to haunt you with peculiar piano riffs  

After hitting return, the message is sent to the message queue, which the parent reads, and echoes to stdout. In this instance, because we have only specified the -i no option, the child process exits, which triggers the parent to close and remove the message queue, before it exits:

     Child: enter a message to send to the parent process (MAX 80 chars)
     >> I will return to haunt you with peculiar piano riffs

 Child: sending message to parent
 Child: closing message queue /ipc_namespace
Parent: received message from child

    Parent: the following message was received from the child
     >> I will return to haunt you with peculiar piano riffs

Parent: closing message queue /ipc_namespace  
Parent: Removing message queue /ipc_namespace  
Parent: ./invoke_ns - Finishing up  

If we now run the program again, but this time specify -i yes as the command line option, the child process will be created in a new IPC namespace and will not be able to open the message queue, because it can't see it:

$ sudo ./invoke_ns -vi yes
[sudo] password for wolf:
Parent: PID of parent is 12508  
Parent: opening message queue /ipc_namespace  
Parent: PID of child is 12509  
 Child: PID of child is 12509
 Child: opening message queue /ipc_namespace
 Child: mq_open: No such file or directory

After 60 seconds, the parent's wait for reading the message queue times out:

Parent: mq_timedreceive: Connection timed out  
Parent: closing message queue /ipc_namespace  
Parent: Removing message queue /ipc_namespace  
Parent: ./invoke_ns - Finishing up  

Each IPC namespace has its own message queue filesystem, which is only visible to processes residing in the same IPC namespace.

In the next article, we'll look at cloning a process into several namespaces, along with a minimal chroot filesystem, in order to create a 'rustic' container.

Part 5 - NET Namespace

So far in this series, we've looked at isolating processes in PID, MNT and UTS namespaces. The next namespace in this sequence is the NET namespace, which allows you to isolate a process in terms of its network stack. That is, a process that is cast into a new NET namespace using one of the system calls clone or unshare, or an existing NET namespace with setns, are isolated from the host's network stack, but enjoy their own private network stack associated with the namespace. The resources that are isolated include routes, firewall rules, network devices, and ports.

We've been developing a program to demonstrate how each of the namespaces work, and the changes that incorporate the NET namespace can be found in invoke_ns4.c, which is available here. The changes that have made from the last iteration are trivial; we've added a new command line option, -n, which specifies that the cloned process is to be created in a new NET namespace, and the addition of the CLONE_NEWNET constant to the clone flags when parsing the command line options:

// Parse command line options and construct arguments
// to be passed to childFunction
while ((option = getopt(argc, argv, "+hvpmu:n")) != -1) {  
    switch (option) {
        case 'n':
            flags |= CLONE_NEWNET;
            break;
        case 'u':
            flags |= CLONE_NEWUTS;
            args.hostname = malloc(sizeof(char) * (strlen(optarg) + 1));
            strcpy(args.hostname, optarg);
            break;
        case 'm':
            flags |= CLONE_NEWNS;
            break;
        case 'p':
            flags |= CLONE_NEWPID;
            break;
        case 'v':
            args.verbose = 1;
            break;
        case 'h':
            usage(argv[0]);
            exit(EXIT_SUCCESS);
        default:
            usage(argv[0]);
            exit(EXIT_FAILURE);
    }
}

Once the new version of the program has been compiled, we can invoke it with the following command line options and arguments:

$ sudo ./invoke_ns -vpmnu calculus env PS1="\h\[\] [\[\]\w\[\]] " PATH=$PATH bash --norc
[sudo] password for wolf:
Parent: PID of parent is 6142  
Parent: PID of child is 6143  
 Child: PID of child is 1
 Child: Executing command bash ...
calculus [/home/wolf]  

The cloned process is created in a number of namespaces, including a new NET namespace. Just as we've done in the previous examples, we can confirm that the process resides in its own NET namespace, distinct from the global NET namespace of all the other processes running on the system:

$ sudo ls -l /proc/6142/ns && sudo ls -l /proc/6143/ns
[sudo] password for wolf:
total 0  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 ipc -> ipc:[4026531839]  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 mnt -> mnt:[4026531840]  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 net -> net:[4026531956]  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 pid -> pid:[4026531836]  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 user -> user:[4026531837]  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 uts -> uts:[4026531838]  
total 0  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 ipc -> ipc:[4026531839]  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 mnt -> mnt:[4026532430]  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 net -> net:[4026532434]  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 pid -> pid:[4026532432]  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 user -> user:[4026531837]  
lrwxrwxrwx 1 root root 0 Mar 12 12:21 uts -> uts:[4026532431]  

At the bash command prompt of the cloned process, we can see what the new NET namespace provides the process:

calculus [/home/wolf] ip link show  
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT qlen 1  
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

The kernel has simply provided a loopback interface, which we can bring up with ip link set lo up.

This is very spartan, and it is up to the initiator to configure a meaningful network stack for the namespace. Assuming that we want any processes running in the new NET namespace to be able to communicate beyond the boundaries of the namespace, then we need to establish a 'tunnel' between the global NET namespace and our newly created NET namespace.

To do this, we can create a pair of virtual ethernet interfaces (veth pair), one which will reside in the new NET namespace, and one that will reside in the global NET namespace. The pair of interfaces are linked to one another like a pipe, and whatever appears at one end, finds its way to the other end.

The veth pair is created in the global NET namespace, and then we need to move one of the interfaces into our new NET namespace. We'll call the interface that will reside in the global NET namespace, 'host', whilst the interface that is to be moved to the new NET namespace will be called 'guest'. In a bash command shell which resides in the global NET namespace, we use the following to create the veth pair:

$ sudo ip link add host type veth peer name guest

When this is done, we can see the newly created veth pair by listing the interfaces available in the global NET namespace:

$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT qlen 1  
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000  
    link/ether 08:00:27:fd:d7:6a brd ff:ff:ff:ff:ff:ff
3: guest@host: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000  
    link/ether 4a:47:28:07:d4:d6 brd ff:ff:ff:ff:ff:ff
4: host@guest: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000  
    link/ether de:7b:af:36:05:e3 brd ff:ff:ff:ff:ff:ff

In order to move the 'guest' interface into the new NET namespace, we need to use the PID of the cloned child process, as it is in the global NET namespace, that is 6143. The command simply moves the interface called 'guest' into the NET namespace which PID 6143 resides in:

$ sudo ip link set guest netns 6143

We can check that this has been accomplished by issuing the ip link show command in both NET namespaces. In the global NET namespace:

$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT qlen 1  
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 1000  
    link/ether 08:00:27:fd:d7:6a brd ff:ff:ff:ff:ff:ff
4: host@if3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000  
    link/ether de:7b:af:36:05:e3 brd ff:ff:ff:ff:ff:ff link-netnsid 0

And in the new NET namespace:

calculus [/home/wolf] ip link show  
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT qlen 1  
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: guest@if4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000  
    link/ether 4a:47:28:07:d4:d6 brd ff:ff:ff:ff:ff:ff link-netnsid 0

We can assign an IP address to each of the interfaces in the veth pair. In the new NET namespace, we can execute the following:

calculus [/home/wolf] ip addr add 10.0.0.2/24 dev guest  
calculus [/home/wolf] ip link set guest up  
calculus [/home/wolf] ip addr show guest  
3: guest@if4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN qlen 1000  
    link/ether 4a:47:28:07:d4:d6 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.0.0.2/24 scope global guest
       valid_lft forever preferred_lft forever

And in the global NET namespace:

$ sudo ip addr add 10.0.0.1/24 dev host
[wolf@centos ~]$ sudo ip link set host up
[wolf@centos ~]$ ip addr show host
4: host@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000  
    link/ether de:7b:af:36:05:e3 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.0.0.1/24 scope global host
       valid_lft forever preferred_lft forever
    inet6 fe80::dc7b:afff:fe36:5e3/64 scope link 
       valid_lft forever preferred_lft forever

Finally, we can check that each interface is reachable from the other by pinging from each end.

In order to make a NET namespace more useful, additional work would need to be carried out to establish a suitable network configuration for a specific purpose, e.g. creating a bridge on the host for bridging to a physical network interface.

In the next article, we'll take a look at the IPC namespace.


Additional notes on NET namespaces

Physical network interfaces can only exist in one NET namespace at a time, and are returned to the global NET namespace as and when the NET namespace they reside in, terminates.

Part 4 - UTS Namespace

The UTS namespace is used to isolate two specific elements of the system that relate to the uname system call. UTS is an abbreviation of UNIX Time Sharing, a term that dates back to the fledgling days of UNIX, when multi-user, multi-tasking operating systems were a novelty.

Ken Thompson & Dennis Ritchie

The UTS namespace is named after the data structure used to store information returned by the uname system call. Specifically, the UTS namespace isolates the hostname and the NIS domain name. NIS, an abbreviation of Network Information Service, is an outdated directory service created by Sun Microsystems for the Solaris operating system, which was discontinued by Oracle after its acquisition of Sun Microsystems. In short, the UTS namespace is about isolating hostnames.

To demonstrate its use, the program we used in the last article about the MNT namespace, has been further adapted as invoke_ns3.c, and is available here. Unsurprisingly, we've added a new command line option to our program, -u, which must be accompanied with a string that will be used to set the hostname in our newly created UTS namespace:

$ ./invoke_ns -h
Usage: ./invoke_ns [options] [cmd [arg...]]  
Options can be:  
    -h           display this help message
    -v           display verbose messages
    -p           new PID namespace
    -m           new MNT namespace
    -u hostname  new UTS namespace with associated hostname

We've added some code to parse the command line when -u is specified, which adds the CLONE_NEWUTS constant to the clone flags, and stores the hostname in some dynamically allocated memory. The pointer to the memory is passed as part of the args structure to the clone system call, which provides the arguments to childFunction:

// Parse command line options and construct arguments
// to be passed to childFunction
while ((option = getopt(argc, argv, "+hvpmu:")) != -1) {  
    switch (option) {
        case 'u':
            flags |= CLONE_NEWUTS;
            args.hostname = malloc(sizeof(char) * (strlen(optarg) + 1));
            strcpy(args.hostname, optarg);
            break;
        case 'm':
            flags |= CLONE_NEWNS;
            break;
        case 'p':
            flags |= CLONE_NEWPID;
            break;
        case 'v':
            args.verbose = 1;
            break;
        case 'h':
            usage(argv[0]);
            exit(EXIT_SUCCESS);
        default:
            usage(argv[0]);
            exit(EXIT_FAILURE);
    }
}

In order to set the hostname inside the new UTS namespace, we need to add some code to childFunction using the sethostname system call:

// Set new hostname in UTS namespace if applicable
if (args->flags & CLONE_NEWUTS)  
    if (sethostname(args->hostname, strlen(args->hostname)) == -1)
        perror(" Child: childFunction: sethostname");

Having compiled invoke_ns3.c, we can run the program with sudo ./invoke_ns -vpmu calculus env PS1='\h\[\] [\[\]\w\[\]] ' bash, which will set the hostname inside the new UTS namespace as 'calculus'. As part of the command that's specified for the cloned process to run, we've set the bash shell prompt (using the environment variable PS1) to include the hostname, which is, of course, 'calculus'. We can also use the hostname command to echo the hostname set for the UTS namespace:

$ sudo ./invoke_ns -vpmu calculus env PS1='\h\[\] [\[\]\w\[\]] ' bash --norc
Parent: PID of parent is 16375  
Parent: PID of child is 16376  
 Child: PID of child is 1
 Child: executing command env ...
calculus [/home/wolf] hostname  
calculus  

Hence, it's possible to provide processes running on a given host system, an entirely different identity in terms of hostname, provided they exist in their own unique UTS namespace.

The next article will cover the NET namespace.

Part 3 - MNT Namespace

In the last article about namespaces, we looked at the PID namespace. This time, we'll take a look at the MNT namespace.

MNT namespaces isolate a set of mount points for a process or processes in a given namespace, providing the opportunity for different processes on a system to have different views of the host's filesystem. When a new MNT namespace is created with either of the system calls clone or unshare, it is done so with a copy of the set of mounts that exist in the MNT namespace where the calling process resides. Any subsequent changes to the set of mounts in the 'parent' or 'child' MNT namespaces, are not reflected in the other .... unless the mount is a shared mount.

In the article where we discussed PID namespaces, we discovered that the process we cloned into a new PID namespace, appeared as if it resided in the PID namespace of its parent. This is because the parent and child processes share the same MNT namespace and view of the filesystem, and when we listed the contents of /proc/$$/ns for the process in the new PID namespace (which has PID 1), we actually listed the namespaces for the 'init' process in the global PID namespace! That's very confusing.

To get around this problem, as well as cloning our process into a new PID namespace, we'll also create a new MNT namespace for it at the same time, and mount a new procfs. The program we started with in the previous article, has been adapted as invoke_ns2.c, and can be downloaded here.

There are a couple of trivial changes; the first is the addition of the -m command line option to specify creation of a new MNT namespace, and the second is the parsing of that option, which adds the CLONE_NEWNS constant to the clone flags:

// Parse command line options and construct arguments
// to be passed to childFunction
while ((option = getopt(argc, argv, "+hvpm")) != -1) {  
    switch (option) {
        case 'm':
            flags |= CLONE_NEWNS;
            break;
        case 'p':
            flags |= CLONE_NEWPID;
            break;
        case 'v':
            args.verbose = 1;
            break;
        case 'h':
            usage(argv[0]);
            exit(EXIT_SUCCESS);
        default:
            usage(argv[0]);
            exit(EXIT_FAILURE);
    }
}

Notice that the clone flag constant for the MNT namespace does not follow the convention associated with the constants for the other namespace types, i.e. CLONE_NEWTYPE. Instead of CLONE_NEWMNT, it is CLONE_NEWNS, and that's because it was the first to appear in the kernel, and points to the notion that there may not have been other namespaces planned at that point in time.

The bigger change to the program, is the following addition to the childFunction, which gets called by the child process:

// Mount new proc instance in new mount namespace if and only if
// the child exists in both a new PID and MNT namespace
if ((args->flags & CLONE_NEWPID) && (args->flags & CLONE_NEWNS)) {  
    if (mount("none", "/proc", "", MS_REC|MS_PRIVATE, NULL) == -1)
        perror(" Child: childFunction: mount");
    if (mount("proc", "/proc", "proc", 0, NULL) == -1)
        perror(" Child: childFunction: mount");
}

This code makes the mount private to the new namespace (systemd makes all mounts 'shared' by default), and mounts a new procfs at /proc within the newly created MNT namespace. Having compiled the modified code contained in invoke_ns2.c, we can run the program with the additional argument:

$ sudo ./invoke_ns -vpm env PS1="ns # " bash --norc
[sudo] password for wolf: 
Parent: PID of parent is 16179  
Parent: PID of child is 16180  
 Child: PID of child is 1
 Child: executing command env ...
ns #  

On the face of it, nothing is any different to the version we ran when we were looking at PID namespaces. Listing the namespace directories for the parent and child in the global namespace yields:

$ sudo ls -l /proc/16179/ns && sudo ls -l /proc/16180/ns
[sudo] password for wolf: 
total 0  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 ipc -> ipc:[4026531839]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 mnt -> mnt:[4026531840]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 net -> net:[4026531956]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 pid -> pid:[4026531836]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 user -> user:[4026531837]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 uts -> uts:[4026531838]  
total 0  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 ipc -> ipc:[4026531839]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 mnt -> mnt:[4026532118]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 net -> net:[4026531956]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 pid -> pid:[4026532173]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 user -> user:[4026531837]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:07 uts -> uts:[4026531838]  

As expected, the parent and child processes reside in different PID and MNT namespaces. When listing the namespaces from the bash command shell prompt of the child process within the new namespaces, however, we get what we expect this time:

ns # ls -l /proc/$$/ns  
total 0  
lrwxrwxrwx. 1 root root 0 Sep  7 13:09 ipc -> ipc:[4026531839]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:09 mnt -> mnt:[4026532118]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:09 net -> net:[4026531956]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:09 pid -> pid:[4026532173]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:09 user -> user:[4026531837]  
lrwxrwxrwx. 1 root root 0 Sep  7 13:09 uts -> uts:[4026531838]  

PID 1 within the new PID and MNT namespaces, reports exactly the same as PID 16180 in the global namespaces; they are a representation of the same process. Additionally, if we do a process listing inside the new namespaces, we get just two processes, the bash command shell and ps itself:

ns # ps  
  PID TTY          TIME CMD
    1 pts/0    00:00:00 bash
    3 pts/0    00:00:00 ps

In the next article, we'll take a look at the UTS namespace.