PID Namespace

A PID namespace isolates process ID (PID) numbers, thereby allowing different processes running on the same Linux system to have the same PID. Processes with the same PID, don’t clash however, because they are isolated from each other by virtue of existing in different PID namespaces.

To demonstrate how they work, we’ll use a C program using the clone system call, which can be downloaded from GitHub. We’ll develop the program in future articles to demonstrate the other namespaces. Let’s take a look at the key parts of the first iteration of the program, invoke_ns1.c.

Firstly, having compiled it gcc -o invoke_ns ./invoke_ns1.c, use of the program can be established by executing ./invoke_ns -h. This provides:

Usage: ./invoke_ns [options] [cmd [arg...]]
Options can be:
    -h           display this help message
    -v           display verbose messages
    -p           new PID namespace

As well as the -h option, the program also takes a -p option to specify the creation of a new PID namespace for the new cloned process, and a -v option to specify verbosity, which is useful for understanding what is happening as the program executes. The program also takes an optional command and associated arguments to execute in the cloned process within the new namespace.

If the -p command line option is used, a flag is set with the CLONE_NEWPID constant, which will ultimately get passed to the clone system call:

// Parse command line options and construct arguments
// to be passed to childFunction
while ((option = getopt(argc, argv, "+hvp")) != -1) {  
    switch (option) {
        case 'p':
            flags |= CLONE_NEWPID;
            break;
        case 'v':
            args.verbose = 1;
            break;
        case 'h':
            usage(argv[0]);
            exit(EXIT_SUCCESS);
        default:
            usage(argv[0]);
            exit(EXIT_FAILURE);
    }
}

Once the command line options have been processed, if a command and associated arguments have been provided to the program at the command line, these are copied into some memory allocated for the purpose. The pointer to the memory block, args.command, along with some other information, will be another argument to the clone system call:

// Assemble command to be executed in namespace
if(optind != argc) {
    args.command = malloc(sizeof(char *) * (argc - optind + 1));
    for (i = optind; i < argc; i++) {
        args.command[i - optind] = malloc(strlen(argv[i]) + 1);
        strcpy(args.command[i - optind], argv[i]);
    }
    args.command[argc - optind] = NULL;
}

Just before the call to clone, we need to allocate a unique memory block to act as the stack for the child process that will get created, so that it doesn’t interfere with the stack associated with the parent:

// Allocate heap for child's stack
child_stack = malloc(STACK_SIZE);
if (child_stack == NULL) {
    perror("Parent: main: malloc");
    exit(EXIT_FAILURE);
}

Finally, we get to execute the clone system call:

// Clone child process
child = clone(childFunction, child_stack + STACK_SIZE, flags | SIGCHLD, &args);

The childFunction argument is a function that gets executed by the child process after it has been created. We’ll discuss this further in a moment. The other arguments include a pointer to the child’s stack, the ‘clone flags’, and our command and arguments and ancillary information, which will be passed to the childFunction.

Staying with the parent process for the moment, if the clone system call is successful, it waits for the child process to terminate, before the program exits:

// Wait for child to finish
if (waitpid(child, NULL, 0) == -1) {
    perror("Parent: main: waitpid");
    exit(EXIT_FAILURE);
}

In the meantime, the clone system call has created a new, child process within a new PID namespace, which calls the function childFunction:

// Function passed to the clone system call
int childFunction(void *child_args)
{
    struct arguments *args = child_args;

    if (args->verbose)
        printf(" Child: PID of child is %d\n", getpid());

    // Execute command if given
    if (args->command != NULL) {
        if (clearenv() != 0)
            fprintf(stderr, " Child: childFunction: couldn't clear environment\n");
        if (args->verbose)
            printf(" Child: executing command %s ...\n", args->command[0]);
        execvp(args->command[0], &args->command[0]);
    }
    else
        exit(EXIT_SUCCESS);

    perror(" Child: childFunction: execvp");
    exit(EXIT_FAILURE);
}

Its purpose is to simply execute the command and its arguments, if one has been provided at the command line. It does this using the execvp system call, which replaces the process image with a new one based on the command provided. That’s it! Let’s see how it works in practice.

First, some information regarding how to determine which namespace a process is running in. Each namespace for a given process is represented by a file in the procfs directory /proc/$$/ns where $$ is the PID of the process in question. For example:

$ ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 wolf wolf 0 Mar  9 15:29 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 wolf wolf 0 Mar  9 15:29 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 wolf wolf 0 Mar  9 15:29 net -> net:[4026531956]
lrwxrwxrwx 1 wolf wolf 0 Mar  9 15:29 pid -> pid:[4026531836]
lrwxrwxrwx 1 wolf wolf 0 Mar  9 15:29 user -> user:[4026531837]
lrwxrwxrwx 1 wolf wolf 0 Mar  9 15:29 uts -> uts:[4026531838]

The files are special symlinks bearing the namespace type (e.g. mnt), which point to inodes which represent unique namespaces, which the kernel reports as a conjugation of the namespace type and the inode number. Hence, if two different processes exist in the same namespace, their symlinks will point to the same inodes.

Every process exists in one namespace of each type, whether it is the global namespace (sometimes called the default or root namespace), or one that has been subsequently created using clone or unshare. You can test this out by starting two separate command shells and issuing the command ls -l /proc/$$/ns in each, and comparing the results - the output from each should be identical, as both processes belong in the global PID namespace.

Let’s create a new PID namespace for a cloned process using invoke_ns, and let’s ascertain whether the child process exists in a new PID namespace. We can create the new PID namespace specifying that the child process execute an interactive bash command shell:

$ sudo ./invoke_ns -vp env PS1="ns # " bash --norc
[sudo] password for wolf: 
Parent: PID of parent is 16034
Parent: PID of child is 16035
 Child: PID of child is 1
 Child: executing command env ...
ns #

Note that we have to be a privileged user to create the PID namespace, the CAP_SYS_ADMIN capability is required. We’ve created the new PID namespace and specified that an interactive bash command shell be executed using the command prompt ‘ns #’. The parent process reports that it has a PID of 16034, and that the cloned child process has a PID of 16035. The child, however, reports that it has a PID of 1, and is executing the command ‘env ...’. Although the child process exists in the newly created namespace, it also has a PID in the namespace from which it was created, in this case the global PID namespace.

In a bash command shell in the global PID namespace (i.e. not at the prompt ‘ns #’), we can list the namespaces for the parent (16034) and child (16035) processes:

$ sudo ls -l /proc/16034/ns && sudo ls -l /proc/16035/ns
[sudo] password for wolf: 
total 0
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 net -> net:[4026531956]
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 pid -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 user -> user:[4026531837]
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 uts -> uts:[4026531838]
total 0
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 net -> net:[4026531956]
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 pid -> pid:[4026532118]
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 user -> user:[4026531837]
lrwxrwxrwx. 1 root root 0 Sep  7 10:52 uts -> uts:[4026531838]

All of the namespaces are identical, apart from the PID namespaces, which indicates that the the parent and child processes are in different PID namespaces. A strange thing happens, however, if we list the namespaces for a process from within our bash command shell in the new PID namespace:

ns # ls -l /proc/$$/ns
total 0
lrwxrwxrwx 1 root root 0 Mar  9 17:03 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Mar  9 17:03 mnt -> mnt:[4026531840]
lrwxrwxrwx 1 root root 0 Mar  9 17:03 net -> net:[4026531956]
lrwxrwxrwx 1 root root 0 Mar  9 17:03 pid -> pid:[4026531836]
lrwxrwxrwx 1 root root 0 Mar  9 17:03 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Mar  9 17:03 uts -> uts:[4026531838]

From within the new PID namespace, listing the namespaces suggests that the child process still exists in the global PID namespace (4026531836). Clearly this is a contradiction, and we’ll explain why this is in the next article, which discusses the Mount namespace.

Additional notes on PID namespaces

When a new PID namespace is created, the first process that is created in the namespace is ‘init-like’ in nature, and is given the PID 1. It is ‘init-like’ in behaviour, because it assumes parenthood of any orphaned processes in the namespace, and when it terminates, all other processes which belong in the same namespace are terminated with the SIGKILL signal, and the namespace is subsequently removed.

PID namespaces can be nested up to a depth of 32, and any given process in a descendent namespace, will have as many different PIDs as it has ancestors + a PID for the namespace in which it resides. A PID in a descendent namespace can be ‘seen’ or operated on (subject to permissions) by processes in ancestor namespaces, and those in its own namespace, but not processes in sibling namespaces or descendent namespaces.