Sandboxing containers

Figure 1. Photo by Pete Linforth on Unsplash

Introduction

In a normal setting, a container would take or execute the syscalls from the operating system without restriction. Sandboxing is a technique for improving container security. It isolates the container with other containers and the host. It could make use of container sandboxing when the containers are untrusted or there is a high risk that the containers would be attacked.

Current implementations

There are a number of sandboxing implementations. KinD/DinD (Kubernetes in Docker or Docker in Docker) can also be considered as a form of sandboxing, but it provides limited protection. It is more likely to be used for Kubernetes/Docker development without using a physical hardware/VM.

Google’s gVisor

gVisor is one of the sandboxing implementations.

Diagram taken and converted from gVisor documentation (Apache 2.0 license)

gVisor is written in Golang. It is an application kernel. It sits between the application and the host kernel. Only limited systems call pass through gVisor. Hence not all the applications would work under gVisor sandboxing. Check out these links for reference:

Note that currently gVisor only have preliminary support (not production support) for ARM64 platform. Also, it currently only supports ARM64 host(kernel) with 4KB page size.

Diagram taken and converted from gVisor documentation (Apache 2.0 license)

In the above diagram, there are basically two types of platform implementations for gVisor.

In the middle and right-hand side of the diagram, the implementation make use of ptrace system call emulation. So, gVisor act as a tracer and keep track the memory, registers and syscalls of the containers (the tracee(s)). This mode of operation has high context switch overhead and is slower. But it works for any platforms, even on platforms without nested virtualization, which some cloud providers would not support.

On the left of the diagram, the implementation make use of KVM. In this implementation, gVisor leverages Kernel-based virtual machines for performance and isolation. Since it requires hardware assisted virtualization support, it requires to run on bare-metal with hardware virtualization support or KVM with nested virtualization support. It could be a problem for cloud providers because they may only provide a VM without nested virtualization support.

Here’s another good article - gVisor: Protecting GKE and serverless users in the real world

Setup gVisor with K3S

In this example, I use Fedora 35, K3S and QEMU/KVM (amd64). gVisor is included in the official Fedora package distribution. But I found the package 'golang-gvisor' only provides the runtime binary 'runsc', the binary for the containerd shim 'containerd-shim-runsc-v1' is not included. It may be suitable for use with Docker. In this example, I am using Kubernetes (containerd), so let’s install gVisor manually.

Install gVisor

This step is from the manual directly.

        (
  set -e
  ARCH=$(uname -m)
  URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH}
  wget ${URL}/runsc ${URL}/runsc.sha512 \
    ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512
  sha512sum -c runsc.sha512 \
    -c containerd-shim-runsc-v1.sha512
  rm -f *.sha512
  chmod a+rx runsc containerd-shim-runsc-v1
  sudo mv runsc containerd-shim-runsc-v1 /usr/local/bin
)

Not supporting cgroup v2

As of the time of the writing of the article, gVisor currently only supports cgroup v1. This is a simple test that the runsc is not supporting the Linux system with cgroup v2:
```
        $ sudo runsc do echo 123
creating container: Rel: can't make user.slice/user-1000.slice/session-36.scope relative to /
```
If the gVisor/runsc supports the Linux system (cgroup v1), the output should be '123'. Back to our example, it is using Fedora 35. To use cgroup v1 instead of v2, make sure the kernel boot parameter 'systemd.unified_cgroup_hierarchy' is set to 0. Then reboot and run the above command, the output should be '123'.

Update containerd template

This is for K3S. Other Kubernetes implementation/containerd setting is similar

        # cd /var/lib/rancher/k3s/agent/etc/containerd
# cp -p config.toml{,.tmpl}

Update and append below to config.toml.tmpl (note the file name to edit is not config.toml)

        disabled_plugins = ["restart"]

[plugins.linux]
  shim_debug = true
[plugins.cri.containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"
[plugins.cri.containerd.runtimes.runsc.options]
  TypeUrl = "io.containerd.runsc.v1.options"
  ConfigPath = "/etc/gvisor-runsc.toml"

Create /etc/gvisor-runsc.toml
```
        log_path = "/var/log/runsc/runsc-%ID%-shim.log"
log_level = "info"
[runsc_config]
platform = "kvm"
  debug = "true"
  debug-log = "/var/log/runsc/runsc-%ID%-gvisor.%COMMAND%.log"
```
If the system has KVM support (hardware virtualization support or nested virtualization), then set platform = "kvm" for the runsc_config. If it does not have KVM support, then omit it, it would be using the ptrace system emulation.
Create a directory to store the debug logs
```
        $ sudo mkdir /var/log/runsc
```
Restart K3S server/agents

Restart the K3S server/agents to make the changes to take effect.

Create a container with gVisor

Enable the runsc runtime

        ---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc

Create a pod/stateful set and use runsc container runtime

        ---
apiVersion: v1
kind: Namespace
metadata:
  name: recon

---
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: recon
  name: recon
  labels:
    app: recon

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: recon
  labels:
    app: recon
  name: recon
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  selector:
    matchLabels:
      app: recon
  serviceName: recon
  template:
    metadata:
      labels:
        app: recon
    spec:
      # (1)
      runtimeClassName: gvisor
      terminationGracePeriodSeconds: 15
      automountServiceAccountToken: false
      serviceAccountName: recon
      containers:
        - name: recon
          image: registry.gitlab.com/patrickdung/pod-recon:v0.2
          imagePullPolicy: "Always"
          resources:
            limits:
              cpu: 50m
              memory: 100M
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - all
            privileged: false
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsGroup: 20000
            runAsUser: 20000
            seccompProfile:
              type: RuntimeDefault
      dnsPolicy: ClusterFirst

runtimeClassName specifies the runtime to use

The most important thing is the runtimeClassName in here (near line 42). It specified the runtime to be used for this container.

    spec:
      runtimeClassName: gvisor

If the pod runs successfully with the sandbox, you should see some processes with the name 'runsc' running in the host.

Let’s take a look inside the sandbox

[host ~]$ kubectl exec -it -n recon pod/recon-0 -- /bin/bash

Sandboxed with gVisor
```
[debug@recon-0 ~]$ systemd-detect-virt
none
```
Actually, the container is running behind the gVisor application kernel. The command 'systemd-detect-virt' is unable to determine what container/vm technology it is using.
Linux kernel inside the KVM sandbox
```
[debug@recon-0 ~]$ uname -r
4.4.0
```
The sandbox is running inside an emulated environment. The emulated kernel version is 4.4.0.

The amicontained binary

[debug@recon-0 bin]$ amicontained-amd64
GOARCH: amd64
Effective UID that executes this binary: 20000
Container Runtime: not-found
Has Namespaces:
        pid: true
        user: false
AppArmor Profile: unconfined
Capabilities:
Seccomp: disabled
Blocked Syscalls (45):
        SETUID SETGID SETSID SETREUID SETREGID SETGROUPS SETRESUID SETRESGID SCHED_SETPARAM SCHED_RR_GET_INTERVAL VHANGUP MODIFY_LDT PIVOT_ROOT _SYSCTL ADJTIMEX CHROOT ACCT SETTIMEOFDAY UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE QUOTACTL LOOKUP_DCOOKIE CLOCK_SETTIME KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL IOPRIO_SET IOPRIO_GET MIGRATE_PAGES MOVE_PAGES CLOCK_ADJTIME KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF

It shows that there are no capabilities in the container (good!) because all the privileges in the stateful set are dropped.
The Seccomp filtering is not effective. But there are some syscalls being blocked. Some syscalls (e.g. SWAPON, KCMP and etc.) are not implemented in gVisor. You can cross check it with gVisor documentation.

Kata Containers

Kata Containers is another type of sandboxing technology. In year 2015, Intel launched the Clear Containers project. It made use of Intel VT-x (Intel Virtualization Technology, a hardware virtualization support on x86 platform) for securing containers. In year 2017, Intel transited the Clear Containers project under the governance of OpenStack Foundation, and it becomes the Kata Containers project. The OpenStack Foundation is now called the OpenInfra Foundation.

Kata Containers requires hardware virtualization support. Besides x86_64 platforms, it also supports ARM platform (HYP mode/virtualization extension), IBM Power Systems and IBM Z mainframes. For hypervisors, it supports QEMU/KVM, ACRN, Cloud Hypervisors and Firecracker. Kata Containers consists of the container runtime (written in Golang) and the agent component (the agent version 2 is re-written in Rust).

Setup Kata Containers with K3S

In this example, I use Fedora 35, K3S and QEMU/KVM (amd64). Kata Containers is included in the official Fedora package distribution. But at the time of writing, it is at version 2.2.3 (with FC36/rawhide, you may have version 2.3.2). But I would like to test the sandboxing with Seccomp support, merged in Dec 2021, so I compiled version 2.4.0 alpha 2. In short it is just the same as version 2.2.3 but with additional Seccomp support.

Here’s the related steps:

Install Kata Containers RPM
```
sudo dnf install kata-containers.x86_64
```

Check if the OS and hardware is capable to run Kata Containers

$ kata-runtime kata-check --verbose

INFO[0000] IOMMUPlatform is disabled by default.
INFO[0000] Looking for releases                          arch=amd64 name=kata-runtime pid=2427588 source=runtime url="https://api.github.com/repos/kata-containers/kata-containers/releases"
No newer release available
INFO[0000] CPU property found                            arch=amd64 description="AMD Architecture CPU" name=AuthenticAMD pid=2427588 source=runtime type=attribute
INFO[0000] CPU property found                            arch=amd64 description="Virtualization support" name=svm pid=2427588 source=runtime type=flag
INFO[0000] CPU property found                            arch=amd64 description="64Bit CPU" name=lm pid=2427588 source=runtime type=flag
INFO[0000] CPU property found                            arch=amd64 description=SSE4.1 name=sse4_1 pid=2427588 source=runtime type=flag
INFO[0000] kernel property found                         arch=amd64 description="Host Support for Linux VM Sockets" name=vhost_vsock pid=2427588 source=runtime type=module
INFO[0000] kernel property found                         arch=amd64 description="Kernel-based Virtual Machine" name=kvm pid=2427588 source=runtime type=module
INFO[0000] kernel property found                         arch=amd64 description="AMD KVM" name=kvm_amd pid=2427588 source=runtime type=module
INFO[0000] kernel property found                         arch=amd64 description="Host kernel accelerator for virtio" name=vhost pid=2427588 source=runtime type=module
INFO[0000] kernel property found                         arch=amd64 description="Host kernel accelerator for virtio network" name=vhost_net pid=2427588 source=runtime type=module
System is capable of running Kata Containers

Load the vhost-vsock module

This module is required for QEMU/KVM. If it is missing it would have problem for container/sandbox.
```
        $ sudo -i
# echo "vhost-vsock" >> /etc/modules-load.d/00-kata-container.conf
# modprobe vhost-vsock
```

Update containerd template

This is for K3S. Other Kubernetes implementation/containerd setting is similar

# cd /var/lib/rancher/k3s/agent/etc/containerd
# cp config.toml{,.tmpl}

Update and append below to config.toml.tmpl (note the file name to edit is not config.toml)

        # https://github.com/kata-containers/kata-containers/blob/2.4.0-alpha2/docs/how-to/containerd-kata.md#configure-containerd-to-use-kata-containers
# https://blog.niflheim.cc/posts/kata_containers_raspberry/

[plugins.cri.containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins.cri.containerd.runtimes.kata]
  runtime_type = "io.containerd.kata.v2"
  privileged_without_host_devices = true
  [plugins.cri.containerd.runtimes.kata.options]
    ConfigPath = "/usr/share/kata-containers/defaults/configuration.toml"

Update configuration.toml

Here’s the changes I made for the /usr/share/kata-containers/defaults/configuration.toml

        --- /usr/share/kata-containers/defaults/configuration.toml.orig.2022-02-01	2021-11-09 02:18:22.000000000 +0800
+++ /usr/share/kata-containers/defaults/configuration.toml	2022-02-02 00:00:49.674540236 +0800
@@ -504,7 +509,8 @@
 # machine and applied by the kata agent. If set to true, seccomp is not applied
 # within the guest
 # (default: true)
-disable_guest_seccomp=true
+disable_guest_seccomp=false
 
 # If enabled, the runtime will create opentracing.io traces and spans.
 # (See https://www.jaegertracing.io/docs/getting-started).
@@ -537,7 +543,12 @@
 # The sandbox cgroup path is the parent cgroup of a container with the PodSandbox annotation.
 # The sandbox cgroup is constrained if there is no container type annotation.
 # See: https://godoc.org/github.com/kata-containers/runtime/virtcontainers#ContainerType
-sandbox_cgroup_only=true
+# cgroup v2 support not (yet) done
+# https://github.com/kata-containers/kata-containers/issues/3038
+sandbox_cgroup_only=false
 
 # If specified, sandbox_bind_mounts identifieds host paths to be mounted (ro) into the sandboxes shared path.
 # This is only valid if filesystem sharing is utilized. The provided path(s) will be bindmounted into the shared fs directory.

Regarding cgroup v2 support

Looks like cgroup v2 is not (yet) supported for Kata Containers. I need this setup at runtime (also set it to after every reboot):
```
        mkdir /sys/fs/cgroup/systemd
mount -t cgroup -o none,name=systemd cgroup /sys/fs/cgroup/systemd
```
Restart K3S server/agents

Restart the K3S server/agents to make the changes to take effect.

Create a container with Kata Containers

Enable the Kata Containers runtime

        ---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata

Create a pod/stateful set and use the Kata Containers runtime

If you have performed the gVisor setting in above, just update the runtime configuration inside the stateful set. Update it from 'runtime: gvisor' to 'runtime: kata'.

Click here if you need a complete listing

        ---
apiVersion: v1
kind: Namespace
metadata:
  name: recon

---
apiVersion: v1
kind: ServiceAccount
metadata:
  namespace: recon
  name: recon
  labels:
    app: recon

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: recon
  labels:
    app: recon
  name: recon
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate
  selector:
    matchLabels:
      app: recon
  serviceName: recon
  template:
    metadata:
      labels:
        app: recon
    spec:
      # (1)
      runtimeClassName: kata
      terminationGracePeriodSeconds: 15
      automountServiceAccountToken: false
      serviceAccountName: recon
      containers:
        - name: recon
          image: registry.gitlab.com/patrickdung/pod-recon:v0.2
          imagePullPolicy: "Always"
          resources:
            limits:
              cpu: 50m
              memory: 100M
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - all
            privileged: false
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            runAsGroup: 20000
            runAsUser: 20000
            seccompProfile:
              type: RuntimeDefault
      dnsPolicy: ClusterFirst
      restartPolicy: Always

runtimeClassName specifies the runtime to use

The most important thing is the runtimeClassName in here (near line 42). It specified the runtime to be used for this container.

    spec:
      runtimeClassName: kata

If the pod runs successfully with the sandbox, you should see some processes with the name 'kata' running in the host.

Let’s take a look inside the sandbox

[host ~]$ kubectl exec -it -n recon pod/recon-0 -- /bin/bash

Sandboxed with KVM
```
[debug@recon-0 ~]$ systemd-detect-virt
kvm
```
By running 'systemd-detect-virt', it shows that the container is running inside a Sandbox (KVM). If there is no sandboxing, 'systemd-detect-virt' should return 'none'.

Linux kernel inside the KVM sandbox

[debug@recon-0 ~]$ uname -r
5.16.5-200.fc35.x86_64

The sandbox is running a kernel version that is same as the host. Behind the scene, Kata Container is creating the necessary Kernel image and modules by the systemd service 'kata-osbuilder-generate.service' (default not enabled). Also, by checking the boot parameters, it shows that it is not the same as the host.

[debug@recon-0 ~]$ cat /proc/cmdline
tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci=lastbus=0 debug panic=1 nr_cpus=12 scsi_mod.scan=none agent.log=debug system.unified_cgroup_hierarchy=1 agent.unified_cgroup_hierarchy=1

The amicontained binary

[debug@recon-0 ~]$ amicontained-amd64
GOARCH: amd64
Effective UID that executes this binary: 20000
Container Runtime: kube
Has Namespaces:
        pid: true
        user: false
AppArmor Profile: kernel
Capabilities:
Seccomp: filtering
Blocked Syscalls (70):
        MSGRCV PTRACE SYSLOG SETUID SETGID SETSID SETREUID SETREGID SETGROUPS SETRESUID SETRESGID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL CHROOT ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES FUTIMESAT UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PKEY_MPROTECT PKEY_ALLOC PKEY_FREE

It shows that there are no capabilities in the container (good!) because all the privileges in the stateful set are dropped.
The Seccomp filtering is effective too! It is because I am using the newest version of Kata Containers and enabled Seccomp filtering in the KVM guest (check disable_guest_seccomp=false in file /usr/share/kata-containers/defaults/configuration.toml).

Sysbox by Nestybox

Consider Sysbox as a container runtime. It does not require the host to have hardware virtualization support, which could be a problem for cloud environment where nested hardware virtualization is not provided. This is a good blog about Sysbox and other sandboxing technologies. It is said that Sysbox is going to support Arm64 in next release. However, on my first glance on Sysbox, I spotted some limitations:

Supported version of Kubernetes is 1.20 and 1.21 on December 2021. Other versions are not supported.
Sysbox community edition only support 16 pods per worker node. Sysbox Enterprise edition removed this limitation.

Conclusion

I have come across some popular container sandboxing solutions. One limitation that I have noticed is that Falco does not currently support sandbox containers. I hope this limitation would be lifted soon. Finally, I hope you enjoy the article and see you next time.

Introduction

Current implementations

Google’s gVisor

Setup gVisor with K3S

Create a container with gVisor

Kata Containers

Setup Kata Containers with K3S

Create a container with Kata Containers

Sysbox by Nestybox

Conclusion

Share this article

Related articles

Utilizing container scanning tools for improving security
Using GitHub actions for detecting Log4J vulnerability in containers
Discussion about the Leaky Vessels security vulnerability on K3S

Twitter responses: 5

Comments

Sandboxing containers

Introduction

Current implementations

Google’s gVisor

Setup gVisor with K3S

Create a container with gVisor

Kata Containers

Setup Kata Containers with K3S

Create a container with Kata Containers

Sysbox by Nestybox

Conclusion

Share this article

Related articles

Utilizing container scanning tools for improving security Using GitHub actions for detecting Log4J vulnerability in containers Discussion about the Leaky Vessels security vulnerability on K3S

Twitter responses: 5

Comments

Utilizing container scanning tools for improving security
Using GitHub actions for detecting Log4J vulnerability in containers
Discussion about the Leaky Vessels security vulnerability on K3S