Sandboxing containers
Category: security
Modified: Tue, 2024-Feb-06
Introduction
In a normal setting, a container would take or execute the syscalls from the operating system without restriction. Sandboxing is a technique for improving container security. It isolates the container with other containers and the host. It could make use of container sandboxing when the containers are untrusted or there is a high risk that the containers would be attacked.
Current implementations
There are a number of sandboxing implementations. KinD/DinD (Kubernetes in Docker or Docker in Docker) can also be considered as a form of sandboxing, but it provides limited protection. It is more likely to be used for Kubernetes/Docker development without using a physical hardware/VM.
Google’s gVisor
gVisor is one of the sandboxing implementations.
gVisor is written in Golang. It is an application kernel. It sits between the application and the host kernel. Only limited systems call pass through gVisor. Hence not all the applications would work under gVisor sandboxing. Check out these links for reference:
Note that currently gVisor only have preliminary support (not production support) for ARM64 platform. Also, it currently only supports ARM64 host(kernel) with 4KB page size.
In the above diagram, there are basically two types of platform implementations for gVisor.
In the middle and right-hand side of the diagram, the implementation make use of ptrace system call emulation. So, gVisor act as a tracer and keep track the memory, registers and syscalls of the containers (the tracee(s)). This mode of operation has high context switch overhead and is slower. But it works for any platforms, even on platforms without nested virtualization, which some cloud providers would not support.
On the left of the diagram, the implementation make use of KVM. In this implementation, gVisor leverages Kernel-based virtual machines for performance and isolation. Since it requires hardware assisted virtualization support, it requires to run on bare-metal with hardware virtualization support or KVM with nested virtualization support. It could be a problem for cloud providers because they may only provide a VM without nested virtualization support.
Setup gVisor with K3S
In this example, I use Fedora 35, K3S and QEMU/KVM (amd64). gVisor is included in the official Fedora package distribution. But I found the package 'golang-gvisor' only provides the runtime binary 'runsc', the binary for the containerd shim 'containerd-shim-runsc-v1' is not included. It may be suitable for use with Docker. In this example, I am using Kubernetes (containerd), so let’s install gVisor manually.
-
Install gVisor
This step is from the manual directly.
( set -e ARCH=$(uname -m) URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH} wget ${URL}/runsc ${URL}/runsc.sha512 \ ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512 sha512sum -c runsc.sha512 \ -c containerd-shim-runsc-v1.sha512 rm -f *.sha512 chmod a+rx runsc containerd-shim-runsc-v1 sudo mv runsc containerd-shim-runsc-v1 /usr/local/bin )
-
Not supporting cgroup v2
As of the time of the writing of the article, gVisor currently only supports cgroup v1. This is a simple test that the runsc is not supporting the Linux system with cgroup v2:
$ sudo runsc do echo 123 creating container: Rel: can't make user.slice/user-1000.slice/session-36.scope relative to /
If the gVisor/runsc supports the Linux system (cgroup v1), the output should be '123'. Back to our example, it is using Fedora 35. To use cgroup v1 instead of v2, make sure the kernel boot parameter 'systemd.unified_cgroup_hierarchy' is set to 0. Then reboot and run the above command, the output should be '123'.
-
Update containerd template
This is for K3S. Other Kubernetes implementation/containerd setting is similar
# cd /var/lib/rancher/k3s/agent/etc/containerd # cp -p config.toml{,.tmpl}
Update and append below to config.toml.tmpl (note the file name to edit is not config.toml)
disabled_plugins = ["restart"] [plugins.linux] shim_debug = true [plugins.cri.containerd.runtimes.runsc] runtime_type = "io.containerd.runsc.v1" [plugins.cri.containerd.runtimes.runsc.options] TypeUrl = "io.containerd.runsc.v1.options" ConfigPath = "/etc/gvisor-runsc.toml"
-
Create /etc/gvisor-runsc.toml
log_path = "/var/log/runsc/runsc-%ID%-shim.log" log_level = "info" [runsc_config] platform = "kvm" debug = "true" debug-log = "/var/log/runsc/runsc-%ID%-gvisor.%COMMAND%.log"
If the system has KVM support (hardware virtualization support or nested virtualization), then set platform = "kvm" for the runsc_config. If it does not have KVM support, then omit it, it would be using the ptrace system emulation.
-
Create a directory to store the debug logs
$ sudo mkdir /var/log/runsc
-
Restart K3S server/agents
Restart the K3S server/agents to make the changes to take effect.
Create a container with gVisor
-
Enable the runsc runtime
--- apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: gvisor handler: runsc
-
Create a pod/stateful set and use runsc container runtime
--- apiVersion: v1 kind: Namespace metadata: name: recon --- apiVersion: v1 kind: ServiceAccount metadata: namespace: recon name: recon labels: app: recon --- apiVersion: apps/v1 kind: StatefulSet metadata: namespace: recon labels: app: recon name: recon spec: podManagementPolicy: OrderedReady replicas: 1 revisionHistoryLimit: 10 updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate selector: matchLabels: app: recon serviceName: recon template: metadata: labels: app: recon spec: # (1) runtimeClassName: gvisor terminationGracePeriodSeconds: 15 automountServiceAccountToken: false serviceAccountName: recon containers: - name: recon image: registry.gitlab.com/patrickdung/pod-recon:v0.2 imagePullPolicy: "Always" resources: limits: cpu: 50m memory: 100M securityContext: allowPrivilegeEscalation: false capabilities: drop: - all privileged: false readOnlyRootFilesystem: true runAsNonRoot: true runAsGroup: 20000 runAsUser: 20000 seccompProfile: type: RuntimeDefault dnsPolicy: ClusterFirst
-
runtimeClassName specifies the runtime to use
The most important thing is the runtimeClassName in here (near line 42). It specified the runtime to be used for this container.
spec: runtimeClassName: gvisor
If the pod runs successfully with the sandbox, you should see some processes with the name 'runsc' running in the host.
-
-
Let’s take a look inside the sandbox
[host ~]$ kubectl exec -it -n recon pod/recon-0 -- /bin/bash
-
Sandboxed with gVisor
[debug@recon-0 ~]$ systemd-detect-virt none
Actually, the container is running behind the gVisor application kernel. The command 'systemd-detect-virt' is unable to determine what container/vm technology it is using.
-
Linux kernel inside the KVM sandbox
[debug@recon-0 ~]$ uname -r 4.4.0
The sandbox is running inside an emulated environment. The emulated kernel version is 4.4.0.
-
The amicontained binary
[debug@recon-0 bin]$ amicontained-amd64 GOARCH: amd64 Effective UID that executes this binary: 20000 Container Runtime: not-found Has Namespaces: pid: true user: false AppArmor Profile: unconfined Capabilities: Seccomp: disabled Blocked Syscalls (45): SETUID SETGID SETSID SETREUID SETREGID SETGROUPS SETRESUID SETRESGID SCHED_SETPARAM SCHED_RR_GET_INTERVAL VHANGUP MODIFY_LDT PIVOT_ROOT _SYSCTL ADJTIMEX CHROOT ACCT SETTIMEOFDAY UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE QUOTACTL LOOKUP_DCOOKIE CLOCK_SETTIME KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL IOPRIO_SET IOPRIO_GET MIGRATE_PAGES MOVE_PAGES CLOCK_ADJTIME KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF
-
It shows that there are no capabilities in the container (good!) because all the privileges in the stateful set are dropped.
-
The Seccomp filtering is not effective. But there are some syscalls being blocked. Some syscalls (e.g. SWAPON, KCMP and etc.) are not implemented in gVisor. You can cross check it with gVisor documentation.
-
Kata Containers
Kata Containers is another type of sandboxing technology. In year 2015, Intel launched the Clear Containers project. It made use of Intel VT-x (Intel Virtualization Technology, a hardware virtualization support on x86 platform) for securing containers. In year 2017, Intel transited the Clear Containers project under the governance of OpenStack Foundation, and it becomes the Kata Containers project. The OpenStack Foundation is now called the OpenInfra Foundation.
Kata Containers requires hardware virtualization support. Besides x86_64 platforms, it also supports ARM platform (HYP mode/virtualization extension), IBM Power Systems and IBM Z mainframes. For hypervisors, it supports QEMU/KVM, ACRN, Cloud Hypervisors and Firecracker. Kata Containers consists of the container runtime (written in Golang) and the agent component (the agent version 2 is re-written in Rust).
Setup Kata Containers with K3S
In this example, I use Fedora 35, K3S and QEMU/KVM (amd64). Kata Containers is included in the official Fedora package distribution. But at the time of writing, it is at version 2.2.3 (with FC36/rawhide, you may have version 2.3.2). But I would like to test the sandboxing with Seccomp support, merged in Dec 2021, so I compiled version 2.4.0 alpha 2. In short it is just the same as version 2.2.3 but with additional Seccomp support.
Here’s the related steps:
-
Install Kata Containers RPM
sudo dnf install kata-containers.x86_64
-
Check if the OS and hardware is capable to run Kata Containers
$ kata-runtime kata-check --verbose INFO[0000] IOMMUPlatform is disabled by default. INFO[0000] Looking for releases arch=amd64 name=kata-runtime pid=2427588 source=runtime url="https://api.github.com/repos/kata-containers/kata-containers/releases" No newer release available INFO[0000] CPU property found arch=amd64 description="AMD Architecture CPU" name=AuthenticAMD pid=2427588 source=runtime type=attribute INFO[0000] CPU property found arch=amd64 description="Virtualization support" name=svm pid=2427588 source=runtime type=flag INFO[0000] CPU property found arch=amd64 description="64Bit CPU" name=lm pid=2427588 source=runtime type=flag INFO[0000] CPU property found arch=amd64 description=SSE4.1 name=sse4_1 pid=2427588 source=runtime type=flag INFO[0000] kernel property found arch=amd64 description="Host Support for Linux VM Sockets" name=vhost_vsock pid=2427588 source=runtime type=module INFO[0000] kernel property found arch=amd64 description="Kernel-based Virtual Machine" name=kvm pid=2427588 source=runtime type=module INFO[0000] kernel property found arch=amd64 description="AMD KVM" name=kvm_amd pid=2427588 source=runtime type=module INFO[0000] kernel property found arch=amd64 description="Host kernel accelerator for virtio" name=vhost pid=2427588 source=runtime type=module INFO[0000] kernel property found arch=amd64 description="Host kernel accelerator for virtio network" name=vhost_net pid=2427588 source=runtime type=module System is capable of running Kata Containers
-
Load the vhost-vsock module
This module is required for QEMU/KVM. If it is missing it would have problem for container/sandbox.
$ sudo -i # echo "vhost-vsock" >> /etc/modules-load.d/00-kata-container.conf # modprobe vhost-vsock
-
Update containerd template
This is for K3S. Other Kubernetes implementation/containerd setting is similar
# cd /var/lib/rancher/k3s/agent/etc/containerd # cp config.toml{,.tmpl}
Update and append below to config.toml.tmpl (note the file name to edit is not config.toml)
# https://github.com/kata-containers/kata-containers/blob/2.4.0-alpha2/docs/how-to/containerd-kata.md#configure-containerd-to-use-kata-containers # https://blog.niflheim.cc/posts/kata_containers_raspberry/ [plugins.cri.containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" [plugins.cri.containerd.runtimes.kata] runtime_type = "io.containerd.kata.v2" privileged_without_host_devices = true [plugins.cri.containerd.runtimes.kata.options] ConfigPath = "/usr/share/kata-containers/defaults/configuration.toml"
-
Update configuration.toml
Here’s the changes I made for the /usr/share/kata-containers/defaults/configuration.toml
--- /usr/share/kata-containers/defaults/configuration.toml.orig.2022-02-01 2021-11-09 02:18:22.000000000 +0800 +++ /usr/share/kata-containers/defaults/configuration.toml 2022-02-02 00:00:49.674540236 +0800 @@ -504,7 +509,8 @@ # machine and applied by the kata agent. If set to true, seccomp is not applied # within the guest # (default: true) -disable_guest_seccomp=true +disable_guest_seccomp=false # If enabled, the runtime will create opentracing.io traces and spans. # (See https://www.jaegertracing.io/docs/getting-started). @@ -537,7 +543,12 @@ # The sandbox cgroup path is the parent cgroup of a container with the PodSandbox annotation. # The sandbox cgroup is constrained if there is no container type annotation. # See: https://godoc.org/github.com/kata-containers/runtime/virtcontainers#ContainerType -sandbox_cgroup_only=true +# cgroup v2 support not (yet) done +# https://github.com/kata-containers/kata-containers/issues/3038 +sandbox_cgroup_only=false # If specified, sandbox_bind_mounts identifieds host paths to be mounted (ro) into the sandboxes shared path. # This is only valid if filesystem sharing is utilized. The provided path(s) will be bindmounted into the shared fs directory.
-
Regarding cgroup v2 support
Looks like cgroup v2 is not (yet) supported for Kata Containers. I need this setup at runtime (also set it to after every reboot):
mkdir /sys/fs/cgroup/systemd mount -t cgroup -o none,name=systemd cgroup /sys/fs/cgroup/systemd
-
Restart K3S server/agents
Restart the K3S server/agents to make the changes to take effect.
Create a container with Kata Containers
-
Enable the Kata Containers runtime
--- apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: kata handler: kata
-
Create a pod/stateful set and use the Kata Containers runtime
If you have performed the gVisor setting in above, just update the runtime configuration inside the stateful set. Update it from 'runtime: gvisor' to 'runtime: kata'.
Click here if you need a complete listing
--- apiVersion: v1 kind: Namespace metadata: name: recon --- apiVersion: v1 kind: ServiceAccount metadata: namespace: recon name: recon labels: app: recon --- apiVersion: apps/v1 kind: StatefulSet metadata: namespace: recon labels: app: recon name: recon spec: podManagementPolicy: OrderedReady replicas: 1 revisionHistoryLimit: 10 updateStrategy: rollingUpdate: partition: 0 type: RollingUpdate selector: matchLabels: app: recon serviceName: recon template: metadata: labels: app: recon spec: # (1) runtimeClassName: kata terminationGracePeriodSeconds: 15 automountServiceAccountToken: false serviceAccountName: recon containers: - name: recon image: registry.gitlab.com/patrickdung/pod-recon:v0.2 imagePullPolicy: "Always" resources: limits: cpu: 50m memory: 100M securityContext: allowPrivilegeEscalation: false capabilities: drop: - all privileged: false readOnlyRootFilesystem: true runAsNonRoot: true runAsGroup: 20000 runAsUser: 20000 seccompProfile: type: RuntimeDefault dnsPolicy: ClusterFirst restartPolicy: Always
-
runtimeClassName specifies the runtime to use
The most important thing is the runtimeClassName in here (near line 42). It specified the runtime to be used for this container.
spec: runtimeClassName: kata
If the pod runs successfully with the sandbox, you should see some processes with the name 'kata' running in the host.
-
-
Let’s take a look inside the sandbox
[host ~]$ kubectl exec -it -n recon pod/recon-0 -- /bin/bash
-
Sandboxed with KVM
[debug@recon-0 ~]$ systemd-detect-virt kvm
By running 'systemd-detect-virt', it shows that the container is running inside a Sandbox (KVM). If there is no sandboxing, 'systemd-detect-virt' should return 'none'.
-
Linux kernel inside the KVM sandbox
[debug@recon-0 ~]$ uname -r 5.16.5-200.fc35.x86_64
The sandbox is running a kernel version that is same as the host. Behind the scene, Kata Container is creating the necessary Kernel image and modules by the systemd service 'kata-osbuilder-generate.service' (default not enabled). Also, by checking the boot parameters, it shows that it is not the same as the host.
[debug@recon-0 ~]$ cat /proc/cmdline tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci=lastbus=0 debug panic=1 nr_cpus=12 scsi_mod.scan=none agent.log=debug system.unified_cgroup_hierarchy=1 agent.unified_cgroup_hierarchy=1
-
The amicontained binary
[debug@recon-0 ~]$ amicontained-amd64 GOARCH: amd64 Effective UID that executes this binary: 20000 Container Runtime: kube Has Namespaces: pid: true user: false AppArmor Profile: kernel Capabilities: Seccomp: filtering Blocked Syscalls (70): MSGRCV PTRACE SYSLOG SETUID SETGID SETSID SETREUID SETREGID SETGROUPS SETRESUID SETRESGID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL CHROOT ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES FUTIMESAT UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PKEY_MPROTECT PKEY_ALLOC PKEY_FREE
-
It shows that there are no capabilities in the container (good!) because all the privileges in the stateful set are dropped.
-
The Seccomp filtering is effective too! It is because I am using the newest version of Kata Containers and enabled Seccomp filtering in the KVM guest (check disable_guest_seccomp=false in file /usr/share/kata-containers/defaults/configuration.toml).
-
Sysbox by Nestybox
Consider Sysbox as a container runtime. It does not require the host to have hardware virtualization support, which could be a problem for cloud environment where nested hardware virtualization is not provided. This is a good blog about Sysbox and other sandboxing technologies. It is said that Sysbox is going to support Arm64 in next release. However, on my first glance on Sysbox, I spotted some limitations:
-
Supported version of Kubernetes is 1.20 and 1.21 on December 2021. Other versions are not supported.
-
Sysbox community edition only support 16 pods per worker node. Sysbox Enterprise edition removed this limitation.
Conclusion
I have come across some popular container sandboxing solutions. One limitation that I have noticed is that Falco does not currently support sandbox containers. I hope this limitation would be lifted soon. Finally, I hope you enjoy the article and see you next time.
Comments
No. of comments: 0
Please read and agree the privacy policy before using the comment system.