Sandbox Your Program Using FreeBSD's Capsicum
By Jake Freeland on September 1, 2023
In an age where data breaches cost upwards of four million dollars on average, it is surprising that only 51% of companies are planning on increasing security investments1. This figure can likely be attributed to the high costs of implementing robust security solutions.
A lot of security frameworks are complicated. Capsicum differentiates itself with its simplicity. Developers can integrate Capsicum with relative ease, securing their programs and reducing those high implementation costs.
This article showcases what Capsicum is capable of, and provides a thorough guide on integrating the framework into new and existing programs.
Article Outline
Capsicum
Capsicum is a lightweight security framework that provides primitives for limiting the capabilities of a program. More specifically, Capsicum allows developers to isolate their programs in a security sandbox. The framework is designed around the principle of least privilege, where programs only have access to resources that are required for operation.
The Capability Sandbox
On systems that support Capsicum, a program may be sandboxed using
cap_enter(2)
:
#include <sys/capsicum.h>
#include <stdio.h>
int
main(void)
{
/* Enter Capsicum's capability mode. */
cap_enter();
printf("Hello world from capability mode\n");
return (0);
}
NOTE: Code snippets in this document omit error handling for brevity. In practice, error handling should done where applicable.
There is no need to link against extra libraries; Capsicum support is bundled into the standard C library (libc).
Capsicum allows developers to sandbox their programs using capability mode.
Once a program enters capability mode, using cap_enter(2)
, it will not be
able to acquire new resources by itself. For example, opening a file using
open(2)
will trigger a capability violation, causing the culprit function
to fail and set errno
to ECAPMODE: Not permitted in capability mode
.
One approach to program Capsicumization is to open all resources before entering capability mode. Resources that belong to the program beforehand may be used inside of the sandbox.
If a program requires access to an unknown number of resources that exist in
a certain subdomain, then the openat(2)
, mkdirat(2)
, bindat(2)
, and
other *at()
system calls may be useful. These functions require an additional
descriptor argument that serves as a relative reference to open new resources.
int dirfd;
dirfd = open("/home/jfree", O_RDONLY | O_DIRECTORY);
cap_enter();
/* Open "/home/jfree/foo". */
if (openat(dirfd, "foo", O_RDONLY) < 0)
printf("This will not happen\n");
In this case, dirfd
is obtained before entering capability mode. Once the
program enters capability mode, dirfd
is used as a relative reference point
to access home/jfree/foo
. It is not possible to access resources outside of
the subdirectory domain provided by a relative reference.
int dirfd;
dirfd = open("/home/jfree", O_RDONLY | O_DIRECTORY);
cap_enter();
/* Open "/home/beastie". */
if (openat(dirfd, "../beastie", O_RDONLY) < 0)
printf("This will happen\n");
The openat(2)
call will fail because the ../beastie
path leads to a directory
that does not fall under the directory hierarchy provided by dirfd
.
Interprocess communication is also curtailed. A process in capability mode may
not send signals to other processes and named shared memory
objects are prohibited. Certain system interfaces are not allowed at all, such
as reboot(2)
and kldload(2)
. Fortunately, there are ways to work around all
of these restrictions.
Sandboxing Compartments
Pre-opening resources works great for programs that have predictable resource requirements, but some programs require resources on-demand. In this case, the developer can opt to only sandbox specific parts of their program.
If the entirety of a program will not work in a sandbox, it may be possible to compartmentalize it. Compartmentalization is the act of splitting a program up into compartments, each with their own basic purpose. With a compartmentalized architecture, a developer can keep trusted code outside of the sandbox, but isolate insecure, or dangerous, code inside of a sandboxed compartment. If a security vulnerability is found in the dangerous code, it will be isolated.
Capsicum provides an intuitive interface for sandboxing specific parts of a program. At any point, a program can spawn a new child process that executes dangerous code inside of capability mode. Interprocess communication primitives like pipes and sockets can allow data exchange without raising capability violations.
pid_t pid;
int pipefd[2], result;
pipe(pipefd);
/*
* Create a child process and isolate it in a capability
* sandbox where it can execute dangerous code.
*/
pid = fork();
if (pid == 0) {
close(pipefd[0]);
cap_enter();
result = dangerous_function();
write(pipefd[1], &result, sizeof(result));
exit(0);
}
close(pipefd[1]);
/* Fetch result from sandboxed child. */
result = read(pipefd[0], &result, sizeof(result));
printf("Result: %d\n", result);
/* Continue normal execution in parent. */
The parent process can live outside of the sandbox while their child executes dangerous code in isolation. If a program is already compartmentalized, then its developer can start by sandboxing each compartment. Most compartments will likely need some refactoring for capability mode, but the developer can pick and choose what needs to be sandboxed. When done right, this is less work than sandboxing the entire program, with a substantial increase in security.
Requesting Resources With libcasper(3)
Some programs were not designed to be compartmentalized. Developers of these
programs could rearchitect their software, but this often requires a lot of
time and resources.
Luckily, the libcasper(3)
library assists developers that have
complex programs where compartmentalization is not effective. Developers can
use the interface provided by libcasper(3)
to acquire new resources while
inside of the capability sandbox.
Using a Casper Service
Before a program enters capability mode, it can open a communication channel with a casper service. Casper services are processes that run outside of the capability sandbox, alongside a calling process. The aforementioned communication channel can be used to request new resources from the casper service.
NOTE:
libcasper(3)
channels should be opened before entering capability mode, otherwise the casper process will inherit the parent's sandbox.
cap_channel_t *cap_casper, *cap_net;
struct addrinfo *res;
int s;
/* Acquire the capability to access libcasper(3) services. */
cap_casper = cap_init();
/*
* Use the cap_casper capability to open a communication
* channel with the "system.net" casper service.
*/
cap_net = cap_service_open(cap_casper, "system.net");
/*
* We do not have any more casper services to open.
* Close the casper capability.
*/
cap_close(cap_casper);
/*
* Use the "cap_" variant of getaddrinfo(), provided by
* the cap_net(3) library.
*/
cap_getaddrinfo(cap_net, "freebsd.org", "80", NULL, &res);
s = socket(res->ai_family, res->ai_socktype, res->ai_protocol);
/*
* Use the "cap_" variant of connect(), provided by
* the cap_net(3) library.
*/
cap_connect(cap_net, s, res->ai_addr, res->ai_addrlen);
The cap_net(3)
library uses libcasper(3)
to provide capability-enabled
libc networking functions that would otherwise fail with ECAPMODE
. Functions
that use the libcasper(3)
interface are conventionally prefixed with cap_
to indicate that they succeed inside of capability mode. There are several other
casper service libraries available, similar to cap_net(3)
, that provide
cap_
-prefixed libc functions. The full list can be found on the libcasper(3)
manual page.
If a program supports building without libcasper(3)
, the developer can use
cap_
functions without surrounding them by #ifdef WITH_CASPER
. Most casper
services define their cap_
functions as macros that substitute to their
non-cap_
form when WITH_CASPER
is not defined.
Creating a Casper Service
Casper service libraries do an excellent job at hiding the libcasper(3)
interface so program developers do not need to interact with it. Sometimes a
program needs access to a resource that is not provided by an existing casper
service library. In this circumstance, program developers can create their
own libcasper(3)
service.
All libcasper(3)
services are built on top of the CREATE_SERVICE(3)
macro:
CREATE_SERVICE(name, limit_func, command_func, flags);
Each argument in CREATE_SERVICE(3)
serves an important purpose, but the
command_func
function pointer is noteworthy because it is a service's main
routine. When a service is opened using cap_service_open(3)
,
it will wait for a command. Once a command is received, it is passed into
command_func
's cmd
argument.
/*
* The command function used by the cap_net(3) casper service
* library.
*/
static int
net_command(const char *cmd, const nvlist_t *limits, nvlist_t *nvlin,
nvlist_t *nvlout)
{
if (strcmp(cmd, "bind") == 0)
return (net_bind(limits, nvlin, nvlout));
else if (strcmp(cmd, "connect") == 0)
return (net_connect(limits, nvlin, nvlout));
else if (strcmp(cmd, "gethostbyname") == 0)
return (net_gethostbyname(limits, nvlin, nvlout));
else if (strcmp(cmd, "gethostbyaddr") == 0)
return (net_gethostbyaddr(limits, nvlin, nvlout));
else if (strcmp(cmd, "getnameinfo") == 0)
return (net_getnameinfo(limits, nvlin, nvlout));
else if (strcmp(cmd, "getaddrinfo") == 0)
return (net_getaddrinfo(limits, nvlin, nvlout));
return (EINVAL);
}
CREATE_SERVICE("system.net", net_limit, net_command, 0);
Most command_func
s compare their cmd
string against a set of
string literals. If there is a match, then an associated function is called.
The casper service exists outside of the capability sandbox, so all functions
called inside of command_func
will execute with
ambient authority, meaning
that they can acquire new resources at-will.
A program and a casper service may exchange resources using the
cap_xfer_nvlist(3)
function. The command string and resources necessary
for the given command must be wrapped in an nvlist(9)
before being
transferred. Nvlists can store numbers, strings, binaries, descriptors,
and other nvlists. Each resource much be accompanied by an identifying name
that can be used for retrieval.
/*
* Send the "bind" command to the casper service linked to @chan.
*/
static int
cap_bind(cap_channel_t *chan, int sockfd, const struct sockaddr *addr,
socklen_t addrlen)
{
nvlist_t *nvl = nvlist_create(0);
int error;
nvlist_add_string(nvl, "cmd", "bind");
nvlist_add_descriptor(nvl, "sockfd", sockfd);
nvlist_add_binary(nvl, "addr", addr, addrlen);
nvl = cap_xfer_nvlist(chan, nvl);
if (nvl == NULL)
return (-1);
error = nvlist_get_number(nvl, "error");
if (error != 0) {
nvlist_destroy(nvl);
errno = error;
return (-1);
}
error = dup2(sockfd, nvlist_get_descriptor(nvl, "sockfd"));
nvlist_destroy(nvl);
return (error == -1 ? -1 : 0);
}
The cap_xfer_nvlist(3)
function sends the nvl
nvlist to the casper service
linked to chan
. Once the nvlist reaches the casper service, the command
string is extracted and the service's command_func
is called.
Recall this snippet from net_command()
:
if (strcmp(cmd, "bind") == 0)
return (net_bind(limits, nvlin, nvlout));
The cap_bind()
function sends the "bind"
command string, so this strcmp()
condition is met and net_bind()
is called.
/*
* Simplified version of the net_bind() function.
* Responsible for extracting arguments from @nvlin, calling bind(2),
* and then adding the return value to @nvlout.
*/
static int
net_bind(const nvlist_t *limits __unused, nvlist_t *nvlin, nvlist_t *nvlout)
{
int sockfd;
const void *addr;
size_t len;
addr = nvlist_get_binary(nvlin, "addr", &len);
sockfd = nvlist_take_descriptor(nvlin, "sockfd");
if (bind(sockfd, saddr, len) < 0) {
int serrno = errno;
close(sockfd);
return (serrno);
}
nvlist_move_descriptor(nvlout, "sockfd", sockfd);
return (0);
}
When bind(2)
succeeds, the socket must be transferred back into the capability
sandbox. The nvlist_move_decriptor(9)
function moves the socket descriptor
into the nvlout
nvlist, taking ownership of the descriptor.
Recall this snippet from cap_bind()
:
nvl = cap_xfer_nvlist(chan, nvl);
if (nvl == NULL)
err(1, "Failed transfer bind() nvlist");
The return value of cap_xfer_nvlist(9)
is the nvlout
nvlist that owns the
bound socket descriptor. This descriptor can be borrowed from the returned
nvlist using nvlist_get_descriptor(nvl, "sockfd")
.
Limiting a Casper Service
The interfaces provided by casper services are often too permissive. When using
cap_net(3)
, it is probable that you only need a subset of the functions that
the service provides. Most service libraries define their own limitations so
developers can disable functions that their program does not use.
/*
* Use cap_net(3)'s limitations to disable everything except
* for resolving the address of freebsd.org on port 80.
*
* This assumes that the cap_net(3) service has already been
* opened and is listening on @cap_net.
*/
cap_net_limit_t *limit;
int familylimit;
/* Allow only name resolution (cap_getaddrinfo(3)). */
limit = cap_net_limit_init(cap_net, CAPNET_NAME2ADDR);
/* Limit name resolution to "freebsd.org" on port 80. */
cap_net_limit_name2addr(limit, "freebsd.org", "80");
/* Limit name resolution to IPv4 addresses. */
familylimit = AF_INET;
cap_net_limit_name2addr_family(limit, &familylimit, 1);
/* Apply the limits to cap_net. */
cap_net_limit(limit);
Every service library offers different limitations. There are too many usage patterns to note here, so it is recommended that developers refer to service library manual pages for specifics and examples.
Creating Limitations for a Casper Service
Developers may specify a limit_func
function pointer in CREATE_SERVICE(3)
to limit a service's interface. When cap_limit_set(3)
is called by a program,
the provided limits
are redirected to the casper service's limit_func
, where
they are applied accordingly. The flexibility of this system allows services to
finely limit their interface.
Most service libraries create a wrapper around cap_limit_set(3)
that takes
a custom _limit_t
type. This custom type is defined by the service library
and keeps track of service-specific limitations.
As limitations become more granular, their implementation can quickly become
complicated. Services like cap_net(3)
offer extensive control over what goes
through their interface, forcing their limit functions handle every edge case.
A quick look at cap_net(3)
's limit function reveals separate verification
functions for each "limitation mode" to ensure that limits are being correctly
applied and enforced.
/*
* A snippet from cap_net(3)'s net_limit() function.
* Some code was cut from this routine for clarity.
* See "lib/libcasper/services/cap_net/cap_net.c" for context.
*/
while ((name = nvlist_next(newlimits, NULL, &cookie)) != NULL) {
/* ... */
if (strcmp(name, LIMIT_NV_BIND) == 0) {
hasbind = true;
if (!verify_bind_newlimts(oldlimits,
cnvlist_get_nvlist(cookie))) {
return (ENOTCAPABLE);
}
} else if (strcmp(name, LIMIT_NV_CONNECT) == 0) {
hasconnect = true;
if (!verify_connect_newlimits(oldlimits,
cnvlist_get_nvlist(cookie))) {
return (ENOTCAPABLE);
}
} else if (strcmp(name, LIMIT_NV_ADDR2NAME) == 0) {
hasaddr2name = true;
if (!verify_addr2name_newlimits(oldlimits,
cnvlist_get_nvlist(cookie))) {
return (ENOTCAPABLE);
}
} else if (strcmp(name, LIMIT_NV_NAME2ADDR) == 0) {
hasname2addr = true;
if (!verify_name2addr_newlimits(oldlimits,
cnvlist_get_nvlist(cookie))) {
return (ENOTCAPABLE);
}
}
}
Limitation functions are naturally dependent on the service that they limit.
For this reason, there is no clear pattern to writing them. Developers
interested in creating limitations for casper services should browse FreeBSD's
source at lib/libcasper/services
for examples from system service libraries.
Recap: Casper Service Components
A casper service is composed of four major components:
- Functions prefixed with
cap_
that issue commands to a casper service usingcap_xfer_nvlist(3)
. - A
command_func
that executes command-dependent code outside of the sandbox and returns newly acquired resources. - A
limit_func
that restricts what the service can be used for. - A
CREATE_SERVICE(3)
macro that glues the service together.
Although it is technically possible to create a casper service that returns any named resource, that would defeat the point of isolating a program in a sandbox. Well designed casper services have a limited, restrictive interface so they cannot be exploited in this manner.
Detecting Violations
When a program is placed in capability mode, it is not always obvious if
it is following the rules of the sandbox. Functions that try to open restricted
resources will raise capability violations and return with errno
set to
ECAPMODE: Not permitted in capability mode
. Even with proper error checking,
hunting down capability violations can take a lot of time. Luckily, the
ktrace(2)
kernel tracing utility can find violations for us.
Programs traditionally need to be put into capability mode before they will
report violations, but ktrace(2)
can record violations when a program
is NOT in capability mode. This means that any developer can run capability
violation tracing on their program with no modification to see where it is
raising violations. Since the program is never actually put into capability
mode, it will still acquire resources and execute normally.
Violation tracing using ktrace(2)
can be started by adding two function calls
at the start of any program:
open("ktrace.out", O_RDONLY | O_CREAT | O_TRUNC);
ktrace("ktrace.out", KTROP_SET, KTRFAC_CAPFAIL, getpid());
This snippet creates an output file for ktrace(2)
and specifies the
KTRFAC_CAPFAIL
trace point so capability failures are recorded.
NOTE: The
ktrace(2)
manual page gives a detailed explanation on using thektrace(2)
system call. Enabling other trace points, likeKTRFAC_NAMEI
to record file name lookups, can help pinpoint the origin of a file system violation.
The cap_violate
routine, shown below, attempts to raise every type of
violation that ktrace(2)
can capture. It is not important to understand
what the routine is doing, just that it raises capability violations.
open("ktrace.out", O_RDONLY | O_CREAT | O_TRUNC);
ktrace("ktrace.out", KTROP_SET, KTRFAC_CAPFAIL, getpid());
cap_rights_init(&rights, CAP_READ);
caph_rights_limit(STDERR_FILENO, &rights);
write(STDERR_FILENO, &val, sizeof(val));
cap_rights_set(&rights, CAP_WRITE);
caph_rights_limit(STDERR_FILENO, &rights);
kinf.kf_structsize = sizeof(struct kinfo_file);
fcntl(STDIN_FILENO, F_KINFO, &kinf);
socket(AF_INET, SOCK_RAW, IPPROTO_ICMP);
addr.sin_family = AF_INET;
addr.sin_port = htons(5000);
addr.sin_addr.s_addr = INADDR_ANY;
bind(socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP),
(const struct sockaddr *)&addr, sizeof(addr));
sendto(fd, NULL, 0, 0, (const struct sockaddr *)&addr, sizeof(addr));
kill(getppid(), SIGCONT);
openat(AT_FDCWD, "/", O_RDONLY);
CPU_SET(0, &cpuset_mask);
cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_PID, getppid(),
sizeof(cpuset_mask), &cpuset_mask);
Once a process is being traced, trace data will be recorded until the process
exits or the trace point is cleared. After one of these conditions is met,
the resulting ktrace(2)
dump can be converted into human readable format using
the kdump(2)
program.
# ./cap_violate
# kdump
1915 cap_violate CAP operation requires CAP_WRITE, descriptor holds CAP_READ
1915 cap_violate CAP attempt to increase capabilities from CAP_READ to CAP_READ,CAP_WRITE
1915 cap_violate CAP system call not allowed: fcntl, cmd: F_KINFO
1915 cap_violate CAP socket: protocol not allowed: IPPROTO_ICMP
1915 cap_violate CAP system call not allowed: bind
1915 cap_violate CAP sendto: restricted address lookup: struct sockaddr { AF_INET, 0.0.0.0:5000 }
1915 cap_violate CAP kill: signal delivery not allowed: SIGCONT
1915 cap_violate CAP openat: restricted VFS lookup: AT_FDCWD
1915 cap_violate CAP cpuset_setaffinity: restricted cpuset operation
Every capability violation in the cap_violate
program translates to a CAP
record in the kdump(1)
output. Developers can use this output to find,
and replace, code that is raising violations.
Most real-world programs should try to avoid capability violations instead
of raising them, like cap_violate
. The code block below shows the kdump(1)
output after tracing archive extraction using unzip(1)
(pre-Capsicumization)
with the KTRFAC_CAPFAIL
and KTRFAC_NAMEI
trace points enabled.
# unzip foo.zip
# kdump
1926 unzip NAMI "foo.zip"
1926 unzip CAP openat: restricted VFS lookup: AT_FDCWD
1926 unzip CAP system call not allowed: open
1926 unzip NAMI "/etc/localtime"
1926 unzip NAMI "bar"
1926 unzip CAP fstatat: restricted VFS lookup: AT_FDCWD
1926 unzip CAP system call not allowed: mkdir
1926 unzip NAMI "bar"
1926 unzip NAMI "bar"
1926 unzip CAP fstatat: restricted VFS lookup: AT_FDCWD
1926 unzip NAMI "bar/bar.txt"
1926 unzip CAP fstatat: restricted VFS lookup: AT_FDCWD
1926 unzip NAMI "bar/bar.txt"
1926 unzip CAP openat: restricted VFS lookup: AT_FDCWD
1926 unzip NAMI "baz"
1926 unzip CAP fstatat: restricted VFS lookup: AT_FDCWD
1926 unzip CAP system call not allowed: mkdir
1926 unzip NAMI "baz"
1926 unzip NAMI "baz"
1926 unzip CAP fstatat: restricted VFS lookup: AT_FDCWD
1926 unzip NAMI "baz/baz.txt"
1926 unzip CAP fstatat: restricted VFS lookup: AT_FDCWD
1926 unzip NAMI "baz/baz.txt"
1926 unzip CAP openat: restricted VFS lookup: AT_FDCWD
This output is more akin to what a developer would see from their program.
unzip(2)
is recreating the file structure contained in the zip archive.
All open(2)
, fstat(2)
, and mkdir(2)
calls are covertly translated into
their *at()
equivalents with AT_FDCWD
in place of the relative reference
descriptor. This conversion is not done by unzip(1)
, but by libc. The
AT_FDCWD
value cannot be used in capability mode, so a violation is raised.
These violations can be avoided by opening a current directory descriptor
(synonymous to AT_FDCWD
) before entering capability mode and passing that
descriptor into openat(2)
, fstatat(2)
, and mkdirat(2)
as a relative
reference.
NOTE: Violations always raise errors in capability mode, but they are not treated as errors while tracing, so program behavior may differ.
Violation tracing is just another tool in the developer's toolbox. It only
takes a few seconds to run a program under ktrace(2)
and the result is
almost always a decent starting point for sandboxing your program using
Capsicum.
Capabilities
Despite their similar names, capability mode and capabilities are different kernel primitives.
- Capability mode is Capsicum's implementation of a security sandbox.
- A capability is a file descriptor that has been extended to possess rights.
If a user wants to read(2)
a capability descriptor, then the descriptor must
possess the CAP_READ
right. If the user wants to bind(2)
a socket capability
descriptor, then the descriptor must possess the CAP_BIND
capability.
There are fine-grained rights for nearly every descriptor operation; see the
rights(4)
manual page for the full list.
When a file descriptor is created, using open(2)
, socket(2)
, etc., it is
given a full set of capabilities. The cap_rights_limit(2)
function may be
used to limit the capabilities of a descriptor.
/*
* Limit a file descriptor to be read-only.
*/
cap_rights_t rights;
int fd;
char buf[1] = 'x';
fd = open("/home/jfree/foo", O_RDWR);
cap_rights_init(&rights, CAP_READ);
cap_rights_limit(fd, &rights);
if (read(fd, buf, sizeof(buf)) < 0)
printf("This will not happen because we have CAP_READ\n");
if (write(fd, buf, sizeof(buf)) < 0)
printf("This will happen because we are missing CAP_WRITE\n");
Capabilities are designed around the principle of minimizing rights. Once a descriptor's rights have been limited, it should not be able to perform actions outside of its rights. A descriptor can always have its rights limited, but never extended.
/*
* Attempt to extend a descriptor's rights.
*/
cap_rights_t rights;
int fd;
fd = open("/home/jfree/foo", O_RDWR);
cap_rights_init(&rights, CAP_READ);
cap_rights_limit(fd, &rights);
cap_rights_set(&rights, CAP_WRITE);
if (cap_rights_limit(fd, &rights) < 0)
printf("Failed to apply rights; rights can never be extended\n");
In cases where sandboxing a program is too restrictive, developers can instead limit capability descriptors. Capabilities grant fine control over what operations are allowed, but programs that use this kind of protection are vulnerable to human error. If a developer forgets to limit a capability, they introduce the possibility of malicious code misusing it.
Developers that want to maximize security can use capabilities inside of
capability mode. Capabilities provide granular functionality that capability
mode does not offer. For example, restricting a descriptor to only allow
read(2)
is possible with capabilities, but not capability mode.
Security With Capsicum
Both capability mode and capabilities were designed to make programs safer. Capability mode provides definite security by isolating a program from the rest of the system. Capabilities offer a more flexible, but less rigorous, way to limit a program's rights. If either primitive is properly integrated, a developer can rest assured that their program is safer than it was before introducing Capsicum.
References
- ^ "Cost of a Data Breach Report 2023". IBM Corporation. 2023-07-24. Retrieved 2023-08-31.
Relevant Materials
https://www.cl.cam.ac.uk/research/security/capsicum/
https://www.usenix.org/legacy/events/sec10/tech/full_papers/Watson.pdf