CloudNative.Quest

Quest to Cloud Native Computing

Sandboxing containers

Author:


Modified: Thu, 2022-Jun-02

Introduction

In a normal setting, a container would take or execute the syscalls from the operating system without restriction. Sandboxing is a technique for improving container security. It isolates the container with other containers and the host. It could make use of container sandboxing when the containers are untrusted or there is a high risk that the containers would be attacked.

Current implementations

There are a number of sandboxing implementations. KinD/DinD (Kubernetes in Docker or Docker in Docker) can also be considered as a form of sandboxing, but it provides limited protection. It is more likely to be used for Kubernetes/Docker development without using a physical hardware/VM.

Google’s gVisor

gVisor is one of the sandboxing implementations.

gvisor Layers.min

gVisor is written in Golang. It is an application kernel. It sits between the application and the host kernel. Only limited systems call pass through gVisor. Hence not all the applications would work under gVisor sandboxing. Check out these links for reference:

gvisor platforms.min

In the above diagram, there are basically two types of platform implementations for gVisor.

In the middle and right-hand side of the diagram, the implementation make use of ptrace system call emulation. So, gVisor act as a tracer and keep track the memory, registers and syscalls of the containers (the tracee(s)). This mode of operation has high context switch overhead and is slower. But it works for any platforms, even on platforms without nested virtualization, which some cloud providers would not support.

On the left of the diagram, the implementation make use of KVM. In this implementation, gVisor leverages Kernel-based virtual machines for performance and isolation. Since it requires hardware assisted virtualization support, it requires to run on bare-metal with hardware virtualization support or KVM with nested virtualization support. It could be a problem for cloud providers because they may only provide a VM without nested virtualization support.

Setup gVisor with K3S

In this example, I use Fedora 35, K3S and QEMU/KVM (amd64). gVisor is included in the official Fedora package distribution. But I found the package 'golang-gvisor' only provides the runtime binary 'runsc', the binary for the containerd shim 'containerd-shim-runsc-v1' is not included. It may be suitable for use with Docker. In this example, I am using Kubernetes (containerd), so let’s install gVisor manually.

  1. Install gVisor

            (
      set -e
      ARCH=$(uname -m)
      URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH}
      wget ${URL}/runsc ${URL}/runsc.sha512 \
        ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512
      sha512sum -c runsc.sha512 \
        -c containerd-shim-runsc-v1.sha512
      rm -f *.sha512
      chmod a+rx runsc containerd-shim-runsc-v1
      sudo mv runsc containerd-shim-runsc-v1 /usr/local/bin
    )
  2. Not supporting cgroup v2

    As of the time of the writing of the article, gVisor currently only supports cgroup v1. This is a simple test that the runsc is not supporting the Linux system with cgroup v2:

            $ sudo runsc do echo 123
    creating container: Rel: can't make user.slice/user-1000.slice/session-36.scope relative to /

    If the gVisor/runsc supports the Linux system (cgroup v1), the output should be '123'. Back to our example, it is using Fedora 35. To use cgroup v1 instead of v2, make sure the kernel boot parameter 'systemd.unified_cgroup_hierarchy' is set to 0. Then reboot and run the above command, the output should be '123'.

  3. Update containerd template

    This is for K3S. Other Kubernetes implementation/containerd setting is similar

            # cd /var/lib/rancher/k3s/agent/containerd
    # cp -p config.toml{,.tmpl}

    Update and append below to config.toml.tmpl (note the file name to edit is not config.toml)

            disabled_plugins = ["restart"]
    
    [plugins.linux]
      shim_debug = true
    [plugins.cri.containerd.runtimes.runsc]
      runtime_type = "io.containerd.runsc.v1"
    [plugins.cri.containerd.runtimes.runsc.options]
      TypeUrl = "io.containerd.runsc.v1.options"
      ConfigPath = "/etc/gvisor-runsc.toml"
  4. Create /etc/gvisor-runsc.toml

            log_path = "/var/log/runsc/runsc-%ID%-shim.log"
    log_level = "info"
    [runsc_config]
    platform = "kvm"
      debug = "true"
      debug-log = "/var/log/runsc/runsc-%ID%-gvisor.%COMMAND%.log"

    If the system has KVM support (hardware virtualization support or nested virtualization), then set platform = "kvm" for the runsc_config. If it does not have KVM support, then omit it, it would be using the ptrace system emulation.

  5. Create a directory to store the debug logs

            $ sudo mkdir /var/log/runsc
  6. Restart K3S server/agents

    Restart the K3S server/agents to make the changes to take effect.

Create a container with gVisor

  1. Enable the runsc runtime

            ---
    apiVersion: node.k8s.io/v1
    kind: RuntimeClass
    metadata:
      name: gvisor
    handler: runsc
  2. Create a pod/stateful set and use runsc container runtime

            ---
    apiVersion: v1
    kind: Namespace
    metadata:
      name: recon
    
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      namespace: recon
      name: recon
      labels:
        app: recon
    
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      namespace: recon
      labels:
        app: recon
      name: recon
    spec:
      podManagementPolicy: OrderedReady
      replicas: 1
      revisionHistoryLimit: 10
      updateStrategy:
        rollingUpdate:
          partition: 0
        type: RollingUpdate
      selector:
        matchLabels:
          app: recon
      serviceName: recon
      template:
        metadata:
          labels:
            app: recon
        spec:
          # (1)
          runtimeClassName: gvisor
          terminationGracePeriodSeconds: 15
          automountServiceAccountToken: false
          serviceAccountName: recon
          containers:
            - name: recon
              image: registry.gitlab.com/patrickdung/pod-recon:v0.2
              imagePullPolicy: "Always"
              resources:
                limits:
                  cpu: 50m
                  memory: 100M
              securityContext:
                allowPrivilegeEscalation: false
                capabilities:
                  drop:
                    - all
                privileged: false
                readOnlyRootFilesystem: true
                runAsNonRoot: true
                runAsGroup: 20000
                runAsUser: 20000
                seccompProfile:
                  type: RuntimeDefault
          dnsPolicy: ClusterFirst
    1. runtimeClassName specifies the runtime to use

    The most important thing is the runtimeClassName in here (near line 42). It specified the runtime to be used for this container.

        spec:
          runtimeClassName: gvisor

    If the pod runs successfully with the sandbox, you should see some processes with the name 'runsc' running in the host.

  3. Let’s take a look inside the sandbox

    [host ~]$ kubectl exec -it -n recon pod/recon-0 -- /bin/bash
  4. Sandboxed with gVisor

    [debug@recon-0 ~]$ systemd-detect-virt
    none

    Actually, the container is running behind the gVisor application kernel. The command 'systemd-detect-virt' is unable to determine what container/vm technology it is using.

  5. Linux kernel inside the KVM sandbox

    [debug@recon-0 ~]$ uname -r
    4.4.0

    The sandbox is running inside an emulated environment. The emulated kernel version is 4.4.0.

  6. The amicontained binary

    [debug@recon-0 bin]$ amicontained-amd64
    GOARCH: amd64
    Effective UID that executes this binary: 20000
    Container Runtime: not-found
    Has Namespaces:
            pid: true
            user: false
    AppArmor Profile: unconfined
    Capabilities:
    Seccomp: disabled
    Blocked Syscalls (45):
            SETUID SETGID SETSID SETREUID SETREGID SETGROUPS SETRESUID SETRESGID SCHED_SETPARAM SCHED_RR_GET_INTERVAL VHANGUP MODIFY_LDT PIVOT_ROOT _SYSCTL ADJTIMEX CHROOT ACCT SETTIMEOFDAY UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE QUOTACTL LOOKUP_DCOOKIE CLOCK_SETTIME KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL IOPRIO_SET IOPRIO_GET MIGRATE_PAGES MOVE_PAGES CLOCK_ADJTIME KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF
    • It shows that there are no capabilities in the container (good!) because all the privileges in the stateful set are dropped.

    • The Seccomp filtering is not effective. But there are some syscalls being blocked. Some syscalls (e.g. SWAPON, KCMP and etc.) are not implemented in gVisor. You can cross check it with gVisor documentation.

Kata Containers

Kata Containers is another type of sandboxing technology. In year 2015, Intel launched the Clear Containers project. It made use of Intel VT-x (Intel Virtualization Technology, a hardware virtualization support on x86 platform) for securing containers. In year 2017, Intel transited the Clear Containers project under the governance of OpenStack Foundation, and it becomes the Kata Containers project. The OpenStack Foundation is now called the OpenInfra Foundation.

Kata Containers requires hardware virtualization support. Besides x86_64 platforms, it also supports ARM platform (HYP mode/virtualization extension), IBM Power Systems and IBM Z mainframes. For hypervisors, it supports QEMU/KVM, ACRN, Cloud Hypervisors and Firecracker. Kata Containers consists of the container runtime (written in Golang) and the agent component (the agent version 2 is re-written in Rust).

Setup Kata Containers with K3S

In this example, I use Fedora 35, K3S and QEMU/KVM (amd64). Kata Containers is included in the official Fedora package distribution. But at the time of writing, it is at version 2.2.3 (with FC36/rawhide, you may have version 2.3.2). But I would like to test the sandboxing with Seccomp support, merged in Dec 2021, so I compiled version 2.4.0 alpha 2. In short it is just the same as version 2.2.3 but with additional Seccomp support.

Here’s the related steps:

  1. Install Kata Containers RPM

    sudo dnf install kata-containers.x86_64
  2. Check if the OS and hardware is capable to run Kata Containers

    $ kata-runtime kata-check --verbose
    
    INFO[0000] IOMMUPlatform is disabled by default.
    INFO[0000] Looking for releases                          arch=amd64 name=kata-runtime pid=2427588 source=runtime url="https://api.github.com/repos/kata-containers/kata-containers/releases"
    No newer release available
    INFO[0000] CPU property found                            arch=amd64 description="AMD Architecture CPU" name=AuthenticAMD pid=2427588 source=runtime type=attribute
    INFO[0000] CPU property found                            arch=amd64 description="Virtualization support" name=svm pid=2427588 source=runtime type=flag
    INFO[0000] CPU property found                            arch=amd64 description="64Bit CPU" name=lm pid=2427588 source=runtime type=flag
    INFO[0000] CPU property found                            arch=amd64 description=SSE4.1 name=sse4_1 pid=2427588 source=runtime type=flag
    INFO[0000] kernel property found                         arch=amd64 description="Host Support for Linux VM Sockets" name=vhost_vsock pid=2427588 source=runtime type=module
    INFO[0000] kernel property found                         arch=amd64 description="Kernel-based Virtual Machine" name=kvm pid=2427588 source=runtime type=module
    INFO[0000] kernel property found                         arch=amd64 description="AMD KVM" name=kvm_amd pid=2427588 source=runtime type=module
    INFO[0000] kernel property found                         arch=amd64 description="Host kernel accelerator for virtio" name=vhost pid=2427588 source=runtime type=module
    INFO[0000] kernel property found                         arch=amd64 description="Host kernel accelerator for virtio network" name=vhost_net pid=2427588 source=runtime type=module
    System is capable of running Kata Containers
  3. Load the vhost-vsock module

    This module is required for QEMU/KVM. If it is missing it would have problem for container/sandbox.

            $ sudo -i
    # echo "vhost-vsock" >> /etc/modules-load.d/00-kata-container.conf
    # modprobe vhost-vsock
  4. Update containerd template

    This is for K3S. Other Kubernetes implementation/containerd setting is similar

    # cd /var/lib/rancher/k3s/agent/containerd
    # cp config.toml{,.tmpl}

    Update and append below to config.toml.tmpl (note the file name to edit is not config.toml)

            # https://github.com/kata-containers/kata-containers/blob/2.4.0-alpha2/docs/how-to/containerd-kata.md#configure-containerd-to-use-kata-containers
    # https://blog.niflheim.cc/posts/kata_containers_raspberry/
    
    [plugins.cri.containerd.runtimes.runc]
      runtime_type = "io.containerd.runc.v2"
    
    [plugins.cri.containerd.runtimes.kata]
      runtime_type = "io.containerd.kata.v2"
      privileged_without_host_devices = true
      [plugins.cri.containerd.runtimes.kata.options]
        ConfigPath = "/usr/share/kata-containers/defaults/configuration.toml"
  5. Update configuration.toml

    Here’s the changes I made for the /usr/share/kata-containers/defaults/configuration.toml

            --- /usr/share/kata-containers/defaults/configuration.toml.orig.2022-02-01	2021-11-09 02:18:22.000000000 +0800
    +++ /usr/share/kata-containers/defaults/configuration.toml	2022-02-02 00:00:49.674540236 +0800
    @@ -504,7 +509,8 @@
     # machine and applied by the kata agent. If set to true, seccomp is not applied
     # within the guest
     # (default: true)
    -disable_guest_seccomp=true
    +disable_guest_seccomp=false
     
     # If enabled, the runtime will create opentracing.io traces and spans.
     # (See https://www.jaegertracing.io/docs/getting-started).
    @@ -537,7 +543,12 @@
     # The sandbox cgroup path is the parent cgroup of a container with the PodSandbox annotation.
     # The sandbox cgroup is constrained if there is no container type annotation.
     # See: https://godoc.org/github.com/kata-containers/runtime/virtcontainers#ContainerType
    -sandbox_cgroup_only=true
    +# cgroup v2 support not (yet) done
    +# https://github.com/kata-containers/kata-containers/issues/3038
    +sandbox_cgroup_only=false
     
     # If specified, sandbox_bind_mounts identifieds host paths to be mounted (ro) into the sandboxes shared path.
     # This is only valid if filesystem sharing is utilized. The provided path(s) will be bindmounted into the shared fs directory.
  6. Regarding cgroup v2 support

    Looks like cgroup v2 is not (yet) supported for Kata Containers. I need this setup at runtime (also set it to after every reboot):

            mkdir /sys/fs/cgroup/systemd
    mount -t cgroup -o none,name=systemd cgroup /sys/fs/cgroup/systemd
  7. Restart K3S server/agents

Restart the K3S server/agents to make the changes to take effect.

Create a container with Kata Containers

  1. Enable the Kata Containers runtime

            ---
    apiVersion: node.k8s.io/v1
    kind: RuntimeClass
    metadata:
      name: kata
    handler: kata
  2. Create a pod/stateful set and use the Kata Containers runtime

    If you have performed the gVisor setting in above, just update the runtime configuration inside the stateful set. Update it from 'runtime: gvisor' to 'runtime: kata'.

    Click here if you need a complete listing
            ---
    apiVersion: v1
    kind: Namespace
    metadata:
      name: recon
    
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      namespace: recon
      name: recon
      labels:
        app: recon
    
    ---
    apiVersion: apps/v1
    kind: StatefulSet
    metadata:
      namespace: recon
      labels:
        app: recon
      name: recon
    spec:
      podManagementPolicy: OrderedReady
      replicas: 1
      revisionHistoryLimit: 10
      updateStrategy:
        rollingUpdate:
          partition: 0
        type: RollingUpdate
      selector:
        matchLabels:
          app: recon
      serviceName: recon
      template:
        metadata:
          labels:
            app: recon
        spec:
          # (1)
          runtimeClassName: kata
          terminationGracePeriodSeconds: 15
          automountServiceAccountToken: false
          serviceAccountName: recon
          containers:
            - name: recon
              image: registry.gitlab.com/patrickdung/pod-recon:v0.2
              imagePullPolicy: "Always"
              resources:
                limits:
                  cpu: 50m
                  memory: 100M
              securityContext:
                allowPrivilegeEscalation: false
                capabilities:
                  drop:
                    - all
                privileged: false
                readOnlyRootFilesystem: true
                runAsNonRoot: true
                runAsGroup: 20000
                runAsUser: 20000
                seccompProfile:
                  type: RuntimeDefault
          dnsPolicy: ClusterFirst
          restartPolicy: Always
    1. runtimeClassName specifies the runtime to use

    The most important thing is the runtimeClassName in here (near line 42). It specified the runtime to be used for this container.

        spec:
          runtimeClassName: kata

    If the pod runs successfully with the sandbox, you should see some processes with the name 'kata' running in the host.

  3. Let’s take a look inside the sandbox

    [host ~]$ kubectl exec -it -n recon pod/recon-0 -- /bin/bash
  4. Sandboxed with KVM

    [debug@recon-0 ~]$ systemd-detect-virt
    kvm

    By running 'systemd-detect-virt', it shows that the container is running inside a Sandbox (KVM). If there is no sandboxing, 'systemd-detect-virt' should return 'none'.

  5. Linux kernel inside the KVM sandbox

    [debug@recon-0 ~]$ uname -r
    5.16.5-200.fc35.x86_64

    The sandbox is running a kernel version that is same as the host. Behind the scene, Kata Container is creating the necessary Kernel image and modules by the systemd service 'kata-osbuilder-generate.service' (default not enabled). Also, by checking the boot parameters, it shows that it is not the same as the host.

    [debug@recon-0 ~]$ cat /proc/cmdline
    tsc=reliable no_timer_check rcupdate.rcu_expedited=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 i8042.noaux=1 noreplace-smp reboot=k console=hvc0 console=hvc1 cryptomgr.notests net.ifnames=0 pci=lastbus=0 debug panic=1 nr_cpus=12 scsi_mod.scan=none agent.log=debug system.unified_cgroup_hierarchy=1 agent.unified_cgroup_hierarchy=1
  6. The amicontained binary

    [debug@recon-0 ~]$ amicontained-amd64
    GOARCH: amd64
    Effective UID that executes this binary: 20000
    Container Runtime: kube
    Has Namespaces:
            pid: true
            user: false
    AppArmor Profile: kernel
    Capabilities:
    Seccomp: filtering
    Blocked Syscalls (70):
            MSGRCV PTRACE SYSLOG SETUID SETGID SETSID SETREUID SETREGID SETGROUPS SETRESUID SETRESGID USELIB USTAT SYSFS VHANGUP PIVOT_ROOT _SYSCTL CHROOT ACCT SETTIMEOFDAY MOUNT UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME IOPL IOPERM CREATE_MODULE INIT_MODULE DELETE_MODULE GET_KERNEL_SYMS QUERY_MODULE QUOTACTL NFSSERVCTL GETPMSG PUTPMSG AFS_SYSCALL TUXCALL SECURITY LOOKUP_DCOOKIE CLOCK_SETTIME VSERVER MBIND SET_MEMPOLICY GET_MEMPOLICY KEXEC_LOAD ADD_KEY REQUEST_KEY KEYCTL MIGRATE_PAGES FUTIMESAT UNSHARE MOVE_PAGES PERF_EVENT_OPEN FANOTIFY_INIT NAME_TO_HANDLE_AT OPEN_BY_HANDLE_AT SETNS PROCESS_VM_READV PROCESS_VM_WRITEV KCMP FINIT_MODULE KEXEC_FILE_LOAD BPF USERFAULTFD PKEY_MPROTECT PKEY_ALLOC PKEY_FREE
    • It shows that there are no capabilities in the container (good!) because all the privileges in the stateful set are dropped.

    • The Seccomp filtering is effective too! It is because I am using the newest version of Kata Containers and enabled Seccomp filtering in the KVM guest (check disable_guest_seccomp=false in file /usr/share/kata-containers/defaults/configuration.toml).

Sysbox by Nestybox

Consider Sysbox as a container runtime. It does not require the host to have hardware virtualization support, which could be a problem for cloud environment where nested hardware virtualization is not provided. This is a good blog about Sysbox and other sandboxing technologies. It is said that Sysbox is going to support Arm64 in next release. However, on my first glance on Sysbox, I spotted some limitations:

Conclusion

I have come across some popular container sandboxing solutions. One limitation that I have noticed is that Falco does not currently support sandbox containers. I hope this limitation would be lifted soon. Finally, I hope you enjoy the article and see you next time.


Share this article


Related articles



Twitter responses: 4


Comments

No. of comments: 0

This site uses Akismet and Google Perspective API to reduce spam and abuses.
Please read and agree the privacy policy before using the comment system.