Perform backup and restore of K3S single master node due to problem with Leapp
Category: kubernetes
Modified: Sun, 2022-Nov-27
Introduction
I have a K3S cluster running on Oracle Cloud Infrastructure (OCI). Now all my K3S worker nodes are changed or upgraded to Oracle Linux 9.1. Please refer to my previous article about upgrading worker node.
But the single master node (control plane) is still running on Oralce Linux 8.7.
[root@control01 /]# kubectl \ get node control01 \ -o yaml | grep -i kernel kernelVersion: 5.4.17-2136.313.6.el8uek.aarch64
I tried to use Leapp to perform an in-place upgrade of OL8 to OL9 for a K3S single master node. When I setup the testing K3S cluster, I did not setup the master node as a three cluster node. The outcome is that the master node is not bootable after using Leapp to upgrade.
For Leapp, please read the Linux vendor release notes for supported matrix and supported upgrade paths. For major upgrade of OS of a node with K3S (also consider CNI, storage plugins), I think it is not officially supported by K3S and Cilium (the CNI for my case).
Finally, if you use this article to perform system upgrade, it is at your own risk.
Upgrade from OL8 to OL9 on OCI
-
The master node to upgrade is called control01. Before the upgrade, it is running on OL 8.7.
-
Since there it is single master node, when I drain the master node, there is service interruption to the service provided by the cluster. As a side note of post-implementation, I would suggest draining all the pods on all the other worker nodes too.
-
Perform full data backup
-
Since I had tried to upgraded an OCI instance (ARM64) from OL 8.7 to OL 9.1 recently, I would use the updated procedure in below.
-
Please check your applications or system compatibility of changing kernel page size from 64KB to 4KB. Like this manual page about Oracle UEK7 kernel.
# On the master node of control plane
# Cordon and drain existing resources on itself
# kubectl cordon control01
# kubectl drain control01 --ignore-daemonsets --delete-emptydir-data --disable-eviction=true
# systemctl stop k3s
# systemctl disable k3s
# Backup K3S server status and settings
# tar -czvf /root/backup-k3s.tar.gz /var/lib/rancher/k3s/server/ /etc/rancher/
# dnf install leapp-upgrade --enablerepo=ol8_leapp,ol8_appstream,ol8_baseos_latest
# dnf config-manager --set-disabled ol8_UEKR6
# dnf config-manager --set-enabled ol8_UEKR7
# dnf update -y
# reboot
# It should return 4096 instead of 65536 after reboot
# getconf PAGESIZE
-
Zone Drifting should be off
# sed -i "s/^AllowZoneDrifting=.*/AllowZoneDrifting=no/" /etc/firewalld/firewalld.conf
-
Run the preupgrade
# --oci is correct for my case
# leapp preupgrade --oci
## --oraclelinux is for on-premise, using this on *OCI* may render your node un-bootable
## leapp preupgrade --oraclelinux
-
Next, is the actual upgrade process
# The instance is on OCI
# leapp upgrade --oci
# Use --oraclelinux for on premises
# Verify the output in /var/log/leapp/
# reboot
The master node is not bootable
BAM! There is no response for the master node. Unlike my previous article. The upgrade failed.
These are the messages displayed when the node reboot. After a quick search on the web, it seems it is related to hardware firmware, UEFI support and secure boot.
EFI stub: Booting Linux Kernel... EFI stub: EFI_RNG_PROTOCOL unavailable EFI stub: Using DTB from configuration table EFI stub: Exiting boot services...
I decided to delete this instance and install a new one with the same hostname.
Reinstall the master node
Then I terminate the node in OCI console. I re-create the node with the same hostname as before. The platform installation image is still OL 9.0. After it is provisioned, I upgrade it to OL 9.1.
-
Restore the necessary files from backup for K3S
-
/etc/rancher
-
/etc/systemd/system/k3s*
-
/var/lib/rancher/server
-
/usr/local/bin/k3s
-
Then start the K3S server
# systemctl enable --now k3s
# systemctl status k3s
Click to view the output of systemctl status k3s
● k3s.service - Lightweight Kubernetes Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled) Active: active (running) since Sat 2022-11-26 12:53:30 GMT; 5h 12min ago Docs: https://k3s.io Process: 2348 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS) Process: 2350 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS) Process: 2355 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS) Main PID: 2359 (k3s-server) Tasks: 309 Memory: 1.3G CPU: 38min 56.002s CGroup: /system.slice/k3s.service ├─2359 "/usr/local/bin/k3s server" ├─2387 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --> ├─2571 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─2693 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─2723 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─3049 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─4302 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─4339 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─4372 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─4509 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─4580 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─4841 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─5034 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─5373 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─5433 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─5531 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─5793 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─6144 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > ├─6511 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > └─6669 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io > Nov 26 18:05:37 control01 k3s[2359]: time="2022-11-26T18:05:37Z" level=debug msg="cgroupv2 io stats: skipping over unmappable dbytes=0 entry" Nov 26 18:05:37 control01 k3s[2359]: time="2022-11-26T18:05:37Z" level=debug msg="cgroupv2 io stats: skipping over unmappable dios=0 entry" Nov 26 18:05:37 control01 k3s[2359]: time="2022-11-26T18:05:37Z" level=debug msg="cgroupv2 io stats: skipping over unmappable dbytes=0 entry" Nov 26 18:05:37 control01 k3s[2359]: time="2022-11-26T18:05:37Z" level=debug msg="cgroupv2 io stats: skipping over unmappable dios=0 entry"
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
node07 Ready <none> 8d v1.24.8+k3s1
control01 Ready,SchedulingDisabled control-plane,master 507d v1.24.8+k3s1
node02 NotReady <none> 7h33m v1.24.8+k3s1
node06 NotReady <none> 56d v1.24.8+k3s1
# kubectl uncordon control01
node02 and node06 is not responding via ssh, then I perform a force reboot on the instances. The K3S cluster automagically recovered to the state before.
Comments
No. of comments: 0
Please read and agree the privacy policy before using the comment system.