Sandbox Your Program Using FreeBSD's Capsicum

By Jake Freeland on September 1, 2023

In an age where data breaches cost upwards of four million dollars on average, it is surprising that only 51% of companies are planning on increasing security investments¹. This figure can likely be attributed to the high costs of implementing robust security solutions.

A lot of security frameworks are complicated. Capsicum differentiates itself with its simplicity. Developers can integrate Capsicum with relative ease, securing their programs and reducing those high implementation costs.

This article showcases what Capsicum is capable of, and provides a thorough guide on integrating the framework into new and existing programs.

Capsicum

Capsicum is a lightweight security framework that provides primitives for limiting the capabilities of a program. More specifically, Capsicum allows developers to isolate their programs in a security sandbox. The framework is designed around the principle of least privilege, where programs only have access to resources that are required for operation.

The Capability Sandbox

On systems that support Capsicum, a program may be sandboxed using cap_enter(2):

#include <sys/capsicum.h>
#include <stdio.h>

int
main(void)
{
    /* Enter Capsicum's capability mode. */
    cap_enter();
    printf("Hello world from capability mode\n");
    return (0);
}

NOTE: Code snippets in this document omit error handling for brevity. In practice, error handling should done where applicable.

There is no need to link against extra libraries; Capsicum support is bundled into the standard C library (libc).

Capsicum allows developers to sandbox their programs using capability mode. Once a program enters capability mode, using cap_enter(2), it will not be able to acquire new resources by itself. For example, opening a file using open(2) will trigger a capability violation, causing the culprit function to fail and set errno to ECAPMODE: Not permitted in capability mode.

One approach to program Capsicumization is to open all resources before entering capability mode. Resources that belong to the program beforehand may be used inside of the sandbox.

If a program requires access to an unknown number of resources that exist in a certain subdomain, then the openat(2), mkdirat(2), bindat(2), and other *at() system calls may be useful. These functions require an additional descriptor argument that serves as a relative reference to open new resources.

int dirfd;

dirfd = open("/home/jfree", O_RDONLY | O_DIRECTORY);
cap_enter();

/* Open "/home/jfree/foo". */
if (openat(dirfd, "foo", O_RDONLY) < 0)
    printf("This will not happen\n");

In this case, dirfd is obtained before entering capability mode. Once the program enters capability mode, dirfd is used as a relative reference point to access home/jfree/foo. It is not possible to access resources outside of the subdirectory domain provided by a relative reference.

int dirfd;

dirfd = open("/home/jfree", O_RDONLY | O_DIRECTORY);
cap_enter();

/* Open "/home/beastie". */
if (openat(dirfd, "../beastie", O_RDONLY) < 0)
    printf("This will happen\n");

The openat(2) call will fail because the ../beastie path leads to a directory that does not fall under the directory hierarchy provided by dirfd.

Interprocess communication is also curtailed. A process in capability mode may not send signals to other processes and named shared memory objects are prohibited. Certain system interfaces are not allowed at all, such as reboot(2) and kldload(2). Fortunately, there are ways to work around all of these restrictions.

Sandboxing Compartments

Pre-opening resources works great for programs that have predictable resource requirements, but some programs require resources on-demand. In this case, the developer can opt to only sandbox specific parts of their program.

If the entirety of a program will not work in a sandbox, it may be possible to compartmentalize it. Compartmentalization is the act of splitting a program up into compartments, each with their own basic purpose. With a compartmentalized architecture, a developer can keep trusted code outside of the sandbox, but isolate insecure, or dangerous, code inside of a sandboxed compartment. If a security vulnerability is found in the dangerous code, it will be isolated.

Capsicum provides an intuitive interface for sandboxing specific parts of a program. At any point, a program can spawn a new child process that executes dangerous code inside of capability mode. Interprocess communication primitives like pipes and sockets can allow data exchange without raising capability violations.

pid_t pid;
int pipefd[2], result;

pipe(pipefd);
/*
 * Create a child process and isolate it in a capability
 * sandbox where it can execute dangerous code.
 */
pid = fork();
if (pid == 0) {
    close(pipefd[0]);
    cap_enter();
    result = dangerous_function();
    write(pipefd[1], &result, sizeof(result));
    exit(0);
}
close(pipefd[1]);
/* Fetch result from sandboxed child. */
result = read(pipefd[0], &result, sizeof(result));
printf("Result: %d\n", result);
/* Continue normal execution in parent. */

The parent process can live outside of the sandbox while their child executes dangerous code in isolation. If a program is already compartmentalized, then its developer can start by sandboxing each compartment. Most compartments will likely need some refactoring for capability mode, but the developer can pick and choose what needs to be sandboxed. When done right, this is less work than sandboxing the entire program, with a substantial increase in security.

Requesting Resources With libcasper(3)

Some programs were not designed to be compartmentalized. Developers of these programs could rearchitect their software, but this often requires a lot of time and resources. Luckily, the libcasper(3) library assists developers that have complex programs where compartmentalization is not effective. Developers can use the interface provided by libcasper(3) to acquire new resources while inside of the capability sandbox.

Using a Casper Service

Before a program enters capability mode, it can open a communication channel with a casper service. Casper services are processes that run outside of the capability sandbox, alongside a calling process. The aforementioned communication channel can be used to request new resources from the casper service.

NOTE: libcasper(3) channels should be opened before entering capability mode, otherwise the casper process will inherit the parent's sandbox.

cap_channel_t *cap_casper, *cap_net;
struct addrinfo *res;
int s;

/* Acquire the capability to access libcasper(3) services. */
cap_casper = cap_init();

/*
 * Use the cap_casper capability to open a communication
 * channel with the "system.net" casper service.
 */
cap_net = cap_service_open(cap_casper, "system.net");

/*
 * We do not have any more casper services to open.
 * Close the casper capability.
 */
cap_close(cap_casper);

/*
 * Use the "cap_" variant of getaddrinfo(), provided by
 * the cap_net(3) library.
 */
cap_getaddrinfo(cap_net, "freebsd.org", "80", NULL, &res);

s = socket(res->ai_family, res->ai_socktype, res->ai_protocol);

/*
 * Use the "cap_" variant of connect(), provided by
 * the cap_net(3) library.
 */
cap_connect(cap_net, s, res->ai_addr, res->ai_addrlen);

The cap_net(3) library uses libcasper(3) to provide capability-enabled libc networking functions that would otherwise fail with ECAPMODE. Functions that use the libcasper(3) interface are conventionally prefixed with cap_ to indicate that they succeed inside of capability mode. There are several other casper service libraries available, similar to cap_net(3), that provide cap_-prefixed libc functions. The full list can be found on the libcasper(3) manual page.

If a program supports building without libcasper(3), the developer can use cap_ functions without surrounding them by #ifdef WITH_CASPER. Most casper services define their cap_ functions as macros that substitute to their non-cap_ form when WITH_CASPER is not defined.

Creating a Casper Service

Casper service libraries do an excellent job at hiding the libcasper(3) interface so program developers do not need to interact with it. Sometimes a program needs access to a resource that is not provided by an existing casper service library. In this circumstance, program developers can create their own libcasper(3) service.

All libcasper(3) services are built on top of the CREATE_SERVICE(3) macro:

CREATE_SERVICE(name, limit_func, command_func, flags);

Each argument in CREATE_SERVICE(3) serves an important purpose, but the command_func function pointer is noteworthy because it is a service's main routine. When a service is opened using cap_service_open(3), it will wait for a command. Once a command is received, it is passed into command_func's cmd argument.

/*
 * The command function used by the cap_net(3) casper service
 * library.
 */
static int
net_command(const char *cmd, const nvlist_t *limits, nvlist_t *nvlin,
    nvlist_t *nvlout)
{
    if (strcmp(cmd, "bind") == 0)
        return (net_bind(limits, nvlin, nvlout));
    else if (strcmp(cmd, "connect") == 0)
        return (net_connect(limits, nvlin, nvlout));
    else if (strcmp(cmd, "gethostbyname") == 0)
        return (net_gethostbyname(limits, nvlin, nvlout));
    else if (strcmp(cmd, "gethostbyaddr") == 0)
        return (net_gethostbyaddr(limits, nvlin, nvlout));
    else if (strcmp(cmd, "getnameinfo") == 0)
        return (net_getnameinfo(limits, nvlin, nvlout));
    else if (strcmp(cmd, "getaddrinfo") == 0)
        return (net_getaddrinfo(limits, nvlin, nvlout));

    return (EINVAL);
}

CREATE_SERVICE("system.net", net_limit, net_command, 0);

Most command_funcs compare their cmd string against a set of string literals. If there is a match, then an associated function is called. The casper service exists outside of the capability sandbox, so all functions called inside of command_func will execute with ambient authority, meaning that they can acquire new resources at-will.

A program and a casper service may exchange resources using the cap_xfer_nvlist(3) function. The command string and resources necessary for the given command must be wrapped in an nvlist(9) before being transferred. Nvlists can store numbers, strings, binaries, descriptors, and other nvlists. Each resource much be accompanied by an identifying name that can be used for retrieval.

/*
 * Send the "bind" command to the casper service linked to @chan.
 */
static int
cap_bind(cap_channel_t *chan, int sockfd, const struct sockaddr *addr,
    socklen_t addrlen)
{
    nvlist_t *nvl = nvlist_create(0);
    int error;

    nvlist_add_string(nvl, "cmd", "bind");
    nvlist_add_descriptor(nvl, "sockfd", sockfd);
    nvlist_add_binary(nvl, "addr", addr, addrlen);

    nvl = cap_xfer_nvlist(chan, nvl);
    if (nvl == NULL)
        return (-1);

    error = nvlist_get_number(nvl, "error");
    if (error != 0) {
        nvlist_destroy(nvl);
        errno = error;
        return (-1);
    }

    error = dup2(sockfd, nvlist_get_descriptor(nvl, "sockfd"));
    nvlist_destroy(nvl);

    return (error == -1 ? -1 : 0);
}

The cap_xfer_nvlist(3) function sends the nvl nvlist to the casper service linked to chan. Once the nvlist reaches the casper service, the command string is extracted and the service's command_func is called.

Recall this snippet from net_command():

    if (strcmp(cmd, "bind") == 0)
        return (net_bind(limits, nvlin, nvlout));

The cap_bind() function sends the "bind" command string, so this strcmp() condition is met and net_bind() is called.

/*
 * Simplified version of the net_bind() function.
 * Responsible for extracting arguments from @nvlin, calling bind(2),
 * and then adding the return value to @nvlout.
 */
static int
net_bind(const nvlist_t *limits __unused, nvlist_t *nvlin, nvlist_t *nvlout)
{
    int sockfd;
    const void *addr;
    size_t len;

    addr = nvlist_get_binary(nvlin, "addr", &len);
    sockfd = nvlist_take_descriptor(nvlin, "sockfd");
    if (bind(sockfd, saddr, len) < 0) {
        int serrno = errno;
        close(sockfd);
        return (serrno);
    }
    nvlist_move_descriptor(nvlout, "sockfd", sockfd);

    return (0);
}

When bind(2) succeeds, the socket must be transferred back into the capability sandbox. The nvlist_move_decriptor(9) function moves the socket descriptor into the nvlout nvlist, taking ownership of the descriptor.

Recall this snippet from cap_bind():

    nvl = cap_xfer_nvlist(chan, nvl);
    if (nvl == NULL)
        err(1, "Failed transfer bind() nvlist");

The return value of cap_xfer_nvlist(9) is the nvlout nvlist that owns the bound socket descriptor. This descriptor can be borrowed from the returned nvlist using nvlist_get_descriptor(nvl, "sockfd").

Limiting a Casper Service

The interfaces provided by casper services are often too permissive. When using cap_net(3), it is probable that you only need a subset of the functions that the service provides. Most service libraries define their own limitations so developers can disable functions that their program does not use.

/*
 * Use cap_net(3)'s limitations to disable everything except
 * for resolving the address of freebsd.org on port 80.
 *
 * This assumes that the cap_net(3) service has already been
 * opened and is listening on @cap_net.
 */
cap_net_limit_t *limit;
int familylimit;

/* Allow only name resolution (cap_getaddrinfo(3)). */
limit = cap_net_limit_init(cap_net, CAPNET_NAME2ADDR);

/* Limit name resolution to "freebsd.org" on port 80. */
cap_net_limit_name2addr(limit, "freebsd.org", "80");

/* Limit name resolution to IPv4 addresses. */
familylimit = AF_INET;
cap_net_limit_name2addr_family(limit, &familylimit, 1);

/* Apply the limits to cap_net. */
cap_net_limit(limit);

Every service library offers different limitations. There are too many usage patterns to note here, so it is recommended that developers refer to service library manual pages for specifics and examples.

Creating Limitations for a Casper Service

Developers may specify a limit_func function pointer in CREATE_SERVICE(3) to limit a service's interface. When cap_limit_set(3) is called by a program, the provided limits are redirected to the casper service's limit_func, where they are applied accordingly. The flexibility of this system allows services to finely limit their interface.

Most service libraries create a wrapper around cap_limit_set(3) that takes a custom _limit_t type. This custom type is defined by the service library and keeps track of service-specific limitations.

As limitations become more granular, their implementation can quickly become complicated. Services like cap_net(3) offer extensive control over what goes through their interface, forcing their limit functions handle every edge case. A quick look at cap_net(3)'s limit function reveals separate verification functions for each "limitation mode" to ensure that limits are being correctly applied and enforced.

/*
 * A snippet from cap_net(3)'s net_limit() function.
 * Some code was cut from this routine for clarity.
 * See "lib/libcasper/services/cap_net/cap_net.c" for context.
 */
while ((name = nvlist_next(newlimits, NULL, &cookie)) != NULL) {
        /* ... */
        if (strcmp(name, LIMIT_NV_BIND) == 0) {
                hasbind = true;
                if (!verify_bind_newlimts(oldlimits,
                    cnvlist_get_nvlist(cookie))) {
                        return (ENOTCAPABLE);
                }
        } else if (strcmp(name, LIMIT_NV_CONNECT) == 0) {
                hasconnect = true;
                if (!verify_connect_newlimits(oldlimits,
                    cnvlist_get_nvlist(cookie))) {
                        return (ENOTCAPABLE);
                }
        } else if (strcmp(name, LIMIT_NV_ADDR2NAME) == 0) {
                hasaddr2name = true;
                if (!verify_addr2name_newlimits(oldlimits,
                    cnvlist_get_nvlist(cookie))) {
                        return (ENOTCAPABLE);
                }
        } else if (strcmp(name, LIMIT_NV_NAME2ADDR) == 0) {
                hasname2addr = true;
                if (!verify_name2addr_newlimits(oldlimits,
                    cnvlist_get_nvlist(cookie))) {
                        return (ENOTCAPABLE);
                }
        }
}

Limitation functions are naturally dependent on the service that they limit. For this reason, there is no clear pattern to writing them. Developers interested in creating limitations for casper services should browse FreeBSD's source at lib/libcasper/services for examples from system service libraries.

Recap: Casper Service Components

A casper service is composed of four major components:

Functions prefixed with cap_ that issue commands to a casper service using cap_xfer_nvlist(3).
A command_func that executes command-dependent code outside of the sandbox and returns newly acquired resources.
A limit_func that restricts what the service can be used for.
A CREATE_SERVICE(3) macro that glues the service together.

Although it is technically possible to create a casper service that returns any named resource, that would defeat the point of isolating a program in a sandbox. Well designed casper services have a limited, restrictive interface so they cannot be exploited in this manner.

Detecting Violations

When a program is placed in capability mode, it is not always obvious if it is following the rules of the sandbox. Functions that try to open restricted resources will raise capability violations and return with errno set to ECAPMODE: Not permitted in capability mode. Even with proper error checking, hunting down capability violations can take a lot of time. Luckily, the ktrace(2) kernel tracing utility can find violations for us.

Programs traditionally need to be put into capability mode before they will report violations, but ktrace(2) can record violations when a program is NOT in capability mode. This means that any developer can run capability violation tracing on their program with no modification to see where it is raising violations. Since the program is never actually put into capability mode, it will still acquire resources and execute normally.

Violation tracing using ktrace(2) can be started by adding two function calls at the start of any program:

open("ktrace.out", O_RDONLY | O_CREAT | O_TRUNC);
ktrace("ktrace.out", KTROP_SET, KTRFAC_CAPFAIL, getpid());

This snippet creates an output file for ktrace(2) and specifies the KTRFAC_CAPFAIL trace point so capability failures are recorded.

NOTE: The ktrace(2) manual page gives a detailed explanation on using the ktrace(2) system call. Enabling other trace points, like KTRFAC_NAMEI to record file name lookups, can help pinpoint the origin of a file system violation.

The cap_violate routine, shown below, attempts to raise every type of violation that ktrace(2) can capture. It is not important to understand what the routine is doing, just that it raises capability violations.

open("ktrace.out", O_RDONLY | O_CREAT | O_TRUNC);
ktrace("ktrace.out", KTROP_SET, KTRFAC_CAPFAIL, getpid());

cap_rights_init(&rights, CAP_READ);
caph_rights_limit(STDERR_FILENO, &rights);
write(STDERR_FILENO, &val, sizeof(val));

cap_rights_set(&rights, CAP_WRITE);
caph_rights_limit(STDERR_FILENO, &rights);

kinf.kf_structsize = sizeof(struct kinfo_file);
fcntl(STDIN_FILENO, F_KINFO, &kinf);

socket(AF_INET, SOCK_RAW, IPPROTO_ICMP);

addr.sin_family = AF_INET;
addr.sin_port = htons(5000);
addr.sin_addr.s_addr = INADDR_ANY;
bind(socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP),
    (const struct sockaddr *)&addr, sizeof(addr));
sendto(fd, NULL, 0, 0, (const struct sockaddr *)&addr, sizeof(addr));

kill(getppid(), SIGCONT);

openat(AT_FDCWD, "/", O_RDONLY);

CPU_SET(0, &cpuset_mask);
cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_PID, getppid(),
    sizeof(cpuset_mask), &cpuset_mask);

Once a process is being traced, trace data will be recorded until the process exits or the trace point is cleared. After one of these conditions is met, the resulting ktrace(2) dump can be converted into human readable format using the kdump(2) program.

# ./cap_violate
# kdump
  1915 cap_violate CAP   operation requires CAP_WRITE, descriptor holds CAP_READ
  1915 cap_violate CAP   attempt to increase capabilities from CAP_READ to CAP_READ,CAP_WRITE
  1915 cap_violate CAP   system call not allowed: fcntl, cmd: F_KINFO
  1915 cap_violate CAP   socket: protocol not allowed: IPPROTO_ICMP
  1915 cap_violate CAP   system call not allowed: bind
  1915 cap_violate CAP   sendto: restricted address lookup: struct sockaddr { AF_INET, 0.0.0.0:5000 }
  1915 cap_violate CAP   kill: signal delivery not allowed: SIGCONT
  1915 cap_violate CAP   openat: restricted VFS lookup: AT_FDCWD
  1915 cap_violate CAP   cpuset_setaffinity: restricted cpuset operation

Every capability violation in the cap_violate program translates to a CAP record in the kdump(1) output. Developers can use this output to find, and replace, code that is raising violations.

Most real-world programs should try to avoid capability violations instead of raising them, like cap_violate. The code block below shows the kdump(1) output after tracing archive extraction using unzip(1) (pre-Capsicumization) with the KTRFAC_CAPFAIL and KTRFAC_NAMEI trace points enabled.

# unzip foo.zip
# kdump
  1926 unzip    NAMI  "foo.zip"
  1926 unzip    CAP   openat: restricted VFS lookup: AT_FDCWD
  1926 unzip    CAP   system call not allowed: open
  1926 unzip    NAMI  "/etc/localtime"
  1926 unzip    NAMI  "bar"
  1926 unzip    CAP   fstatat: restricted VFS lookup: AT_FDCWD
  1926 unzip    CAP   system call not allowed: mkdir
  1926 unzip    NAMI  "bar"
  1926 unzip    NAMI  "bar"
  1926 unzip    CAP   fstatat: restricted VFS lookup: AT_FDCWD
  1926 unzip    NAMI  "bar/bar.txt"
  1926 unzip    CAP   fstatat: restricted VFS lookup: AT_FDCWD
  1926 unzip    NAMI  "bar/bar.txt"
  1926 unzip    CAP   openat: restricted VFS lookup: AT_FDCWD
  1926 unzip    NAMI  "baz"
  1926 unzip    CAP   fstatat: restricted VFS lookup: AT_FDCWD
  1926 unzip    CAP   system call not allowed: mkdir
  1926 unzip    NAMI  "baz"
  1926 unzip    NAMI  "baz"
  1926 unzip    CAP   fstatat: restricted VFS lookup: AT_FDCWD
  1926 unzip    NAMI  "baz/baz.txt"
  1926 unzip    CAP   fstatat: restricted VFS lookup: AT_FDCWD
  1926 unzip    NAMI  "baz/baz.txt"
  1926 unzip    CAP   openat: restricted VFS lookup: AT_FDCWD

This output is more akin to what a developer would see from their program. unzip(2) is recreating the file structure contained in the zip archive. All open(2), fstat(2), and mkdir(2) calls are covertly translated into their *at() equivalents with AT_FDCWD in place of the relative reference descriptor. This conversion is not done by unzip(1), but by libc. The AT_FDCWD value cannot be used in capability mode, so a violation is raised. These violations can be avoided by opening a current directory descriptor (synonymous to AT_FDCWD) before entering capability mode and passing that descriptor into openat(2), fstatat(2), and mkdirat(2) as a relative reference.

NOTE: Violations always raise errors in capability mode, but they are not treated as errors while tracing, so program behavior may differ.

Violation tracing is just another tool in the developer's toolbox. It only takes a few seconds to run a program under ktrace(2) and the result is almost always a decent starting point for sandboxing your program using Capsicum.

Capabilities

Despite their similar names, capability mode and capabilities are different kernel primitives.

Capability mode is Capsicum's implementation of a security sandbox.
A capability is a file descriptor that has been extended to possess rights.

If a user wants to read(2) a capability descriptor, then the descriptor must possess the CAP_READ right. If the user wants to bind(2) a socket capability descriptor, then the descriptor must possess the CAP_BIND capability. There are fine-grained rights for nearly every descriptor operation; see the rights(4) manual page for the full list.

When a file descriptor is created, using open(2), socket(2), etc., it is given a full set of capabilities. The cap_rights_limit(2) function may be used to limit the capabilities of a descriptor.

/*
 * Limit a file descriptor to be read-only.
 */
cap_rights_t rights;
int fd;
char buf[1] = 'x';

fd = open("/home/jfree/foo", O_RDWR);

cap_rights_init(&rights, CAP_READ);
cap_rights_limit(fd, &rights);

if (read(fd, buf, sizeof(buf)) < 0)
    printf("This will not happen because we have CAP_READ\n");

if (write(fd, buf, sizeof(buf)) < 0)
    printf("This will happen because we are missing CAP_WRITE\n");

Capabilities are designed around the principle of minimizing rights. Once a descriptor's rights have been limited, it should not be able to perform actions outside of its rights. A descriptor can always have its rights limited, but never extended.

/*
 * Attempt to extend a descriptor's rights.
 */
cap_rights_t rights;
int fd;

fd = open("/home/jfree/foo", O_RDWR);

cap_rights_init(&rights, CAP_READ);
cap_rights_limit(fd, &rights);

cap_rights_set(&rights, CAP_WRITE);
if (cap_rights_limit(fd, &rights) < 0)
    printf("Failed to apply rights; rights can never be extended\n");

In cases where sandboxing a program is too restrictive, developers can instead limit capability descriptors. Capabilities grant fine control over what operations are allowed, but programs that use this kind of protection are vulnerable to human error. If a developer forgets to limit a capability, they introduce the possibility of malicious code misusing it.

Developers that want to maximize security can use capabilities inside of capability mode. Capabilities provide granular functionality that capability mode does not offer. For example, restricting a descriptor to only allow read(2) is possible with capabilities, but not capability mode.

Security With Capsicum

Both capability mode and capabilities were designed to make programs safer. Capability mode provides definite security by isolating a program from the rest of the system. Capabilities offer a more flexible, but less rigorous, way to limit a program's rights. If either primitive is properly integrated, a developer can rest assured that their program is safer than it was before introducing Capsicum.

References

^ "Cost of a Data Breach Report 2023". IBM Corporation. 2023-07-24. Retrieved 2023-08-31.

Relevant Materials

https://www.cl.cam.ac.uk/research/security/capsicum/

https://www.usenix.org/legacy/events/sec10/tech/full_papers/Watson.pdf

https://wiki.freebsd.org/Capsicum