Modified: Sat, 2022-Nov-26
I have a K3S cluster running on Oracle Cloud Infrastructure (OCI). Recently, RedHat released RHEL 8.7 and RHEL 9.1. After a while, Oracle followed and released Oracle Linux 8.7 and and Oracle Linux 9.1 (I called it OL 8.7 and OL 9.1). Originally, all my K3S nodes were running on OL8. I had removed a node with OL8 and replaced it with OL9. It is just doing a fresh cloud instance installation and then join the K3S cluster.
In this article, I would use Leapp to perform an in-place upgrade of OL 8.7 to OL 9.1 for a K3S worker node. Leapp is an application framework for updating and upgrading OS and applications. Leapp is the upstream project, most enterprise Linux distributions (e.g. RHEL and Oracle Linux) has their own modifications on Leapp.
Please read the Linux vendor release notes for supported matrix and supported upgrade paths. For major upgrade of OS of a node with K3S (also consider CNI, storage plugins), I think it is not officially supported by K3S and Cilium (the CNI for my case). Also, if you use this article to perform system upgrade, it is at your own risk.
Upgrade from OL 8.7 to OL 9.1 on OCI
The node to upgrade is called node02. Before the upgrade, it is a worker node running on OL 8.6.
Perform full data backup
Upgrade node02 from OL 8.6 to 8.7 and with lastest RPM. Just run 'dnf -y update' and 'reboot'
Next step is the pre-upgrade checks of Leapp
# On the master node of control plane # Cordon and drain existing resources on node02 # kubectl cordon node02 # kubectl drain node02 --ignore-daemonsets --delete-emptydir-data --disable-eviction=true # dnf install leapp-upgrade --enablerepo=ol8_leapp,ol8_appstream,ol8_baseos_latest # --oci is correct for my case # leapp preupgrade --oci ## --oraclelinux is for on-premise, using this on *OCI* may render your node un-bootable ## leapp preupgrade --oraclelinux
Two problems are reported:
1) Kernel running with 64KB pages
Since this node (ARM platform) is installed with OL8 platform image, it would be using 64KB page size. OL9 ARM platform kernel revert back from 64KB page size to 4KB page size. To fix this, install UEK7 (Oracle Unbreakable Kernel version 7) on OL8.
But please check your applications or system compatibility of changing kernel page size from 64KB to 4KB. Like this manual page about Oracle UEK7 kernel.
# dnf config-manager --set-disabled ol8_UEKR6 # dnf config-manager --set-enabled ol8_UEKR7 # dnf update -y # reboot # It should return 4096 instead of 65536 after reboot # getconf PAGESIZE
2) It complained about Zone Drifting should be off
To fix this, run:
# sed -i "s/^AllowZoneDrifting=.*/AllowZoneDrifting=no/" /etc/firewalld/firewalld.conf
leapp preupgrade --oci Check the reports in /var/log/leapp/
It should pass the pre-upgrade checks.
Next step is the actual upgrade:
# The instance is on OCI # leapp upgrade --oci # Use --oraclelinux for on-premises # Verify the output in /var/log/leapp/ # reboot # kubectl uncordon node02
Problem with K3S after OS upgrade
After OS reboot, it is now OL 9.1. K3S can start. But firewalld system service failed to start and CNI Cilium reported problem. The Cilium pod on the upgraded node kept on crashing and cannot start.
1) I had to comment out this line on '/etc/firewalld/firewalld.conf': #AllowZoneDrifting=no and then restart firewalld system service.
level=debug msg="Greeting failed" error="Get \"http://10.42.3.20:4240/hello\": dial tcp 10.42.3.20:4240: connect: no route to host" host="http://10.42.3.20:4240" ipAddr=10.42.3.20 nodeName=node02 path="Via L3" subsys=health-server level=debug msg="Failed to probe: Get \"http://10.42.3.20:4240/hello\": dial tcp 10.42.3.20:4240: connect: no route to host" ipAddr=10.42.3.20 nodeName=node02 port=4240 subsys=health-server
It has network communication problems but I do not know what is the resolution at the moment. I tried to reboot upgraded node, delete the problem pod, re-deploy the Helm chart and delete the node from K3S and then let it join to the cluster again. But these cannot solve the problem.
I attempt to solve the problem with:
# *** Run at your own risk! *** # On the master node of control plane # kubectl drain node02 --ignore-daemonsets --delete-emptydir-data --disable-eviction=true # kubectl delete node02 # On node02: # systemctl stop k3s-agent # mv /opt/cni /opt/cni.old # mv /var/lib/rancher/k3s /var/lib/rancher/k3s.old # reboot
After reboot, the node re-joins the K3S cluster because of the service k3s-agent is configured to start at reboot. Also, the Ceph / Rook storage resumes automatically without problem. After confirm everything is fine, you may want to remove the '/opt/cni.old' and '/var/lib/rancher/k3s.old' directories.
Finally, please note I do not found Leapp reliably work with OCI on ARM64. I have tried three nodes but two of them become not bootable after the upgrade. In the next article, I would describe what happened when the K3S master node failed to upgrade.