seccomp-bpf Usage, eBPF Use Cases in Kubernetes, and Application-Applied LSM

## Limiting syscalls with seccomp-bpf Linux has more than 300 system calls (read, write, open, close, etc.) available for use—or misuse. Most applications only need a small subset of syscalls to function properly. [seccomp](https://en.wikipedia.org/wiki/Seccomp) is a Linux security facility used to limit the set of syscalls that an application can use, thereby limiting potential misuse. The original implementation of seccomp was highly restrictive. Once applied, if an application attempted to do anything beyond reading and writing to files it had already opened, seccomp sent a `SIGKILL` signal. [seccomp-bpf](https://blog.yadutaf.fr/2014/05/29/introduction-to-seccomp-bpf-linux-syscall-filter/) enables more complex filters and a wider range of actions. Seccomp-bpf, also known as seccomp mode 2, allows for applying custom filters in the form of BPF programs. When the BPF program is loaded, the filter is applied to each syscall and the appropriate action is taken (Allow, Kill, Trap, etc.). seccomp-bpf is widely used in Kubernetes tools and exposed in Kubernetes itself. For example, seccomp-bpf is used in Docker to apply custom [seccomp security profiles](https://docs.docker.com/engine/security/seccomp/), in rkt to apply [seccomp isolators](https://github.com/rkt/rkt/blob/5fadf0f1f444cdfde40d57e1d199b6dd6371594c/Documentation/seccomp-guide.md), and in Kubernetes itself in its [Security Context](/docs/tasks/configure-pod-container/security-context/). But in all of these cases the use of BPF is hidden behind [libseccomp](https://github.com/seccomp/libseccomp). Behind the scenes, libseccomp generates BPF code from rules provided to it. Once generated, the BPF program is loaded and the rules applied. ## Potential Use Cases for eBPF with Kubernetes eBPF is a relatively new Linux technology. As such, there are many uses that remain unexplored. eBPF itself is also evolving: new features are being added in eBPF that will enable new use cases that aren’t currently possible. In the following sections, we're going to look at some of these that have only recently become possible and ones on the horizon. Our hope is that these features will be leveraged by open source tooling. ## Pod and container level network statistics BPF socket filtering is nothing new, but BPF socket filtering per cgroup is. Introduced in Linux 4.10, [cgroup-bpf](https://lwn.net/Articles/698073/) allows attaching eBPF programs to cgroups. Once attached, the program is executed for all packets entering or exiting any process in the cgroup. A [cgroup](http://man7.org/linux/man-pages/man7/cgroups.7.html) is, amongst other things, a hierarchical grouping of processes. In Kubernetes, this grouping is found at the container level. One idea for making use of cgroup-bpf, is to install BPF programs that collect detailed per-pod and/or per-container network statistics. Generally, such statistics are collected by periodically checking the relevant file in Linux's `/sys` directory or using Netlink. By using BPF programs attached to cgroups for this, we can get much more detailed statistics: for example, how many packets/bytes on tcp port 443, or how many packets/bytes from IP 10.2.3.4. In general, because BPF programs have a kernel context, they can safely and efficiently deliver more detailed information to user space. To explore the idea, the Kinvolk team implemented a proof-of-concept: [https://github.com/kinvolk/cgnet](https://github.com/kinvolk/cgnet). This project attaches a BPF program to each cgroup and exports the information to [Prometheus](https://prometheus.io/). There are of course other interesting possibilities, like doing actual packet filtering. But the obstacle currently standing in the way of this is having cgroup v2 support—required by cgroup-bpf—in [Docker](https://github.com/opencontainers/runc/issues/654) and Kubernetes. ## Application-applied LSM [Linux Security Modules](https://en.wikipedia.org/wiki/Linux_Security_Modules) (LSM) implements a generic framework for security policies in the Linux kernel. [SELinux](https://wiki.centos.org/HowTos/SELinux) and [AppArmor](https://wiki.ubuntu.com/AppArmor) are examples of these. Both of these implement rules at a system-global scope, placing the onus on the administrator to configure the security policies.

The text details the use of seccomp-bpf for syscall filtering in Kubernetes tools like Docker and rkt, often hidden behind libseccomp. It then explores potential eBPF use cases within Kubernetes, focusing on pod and container-level network statistics via cgroup-bpf, as demonstrated by the cgnet proof-of-concept. Finally, it introduces Linux Security Modules (LSM) and contrasts them with system-global security policies like SELinux and AppArmor.