CloudNative.Quest

Quest to Cloud Native Computing

Perform backup and restore of K3S single master node due to problem with Leapp

Author:


Modified: Sun, 2022-Nov-27

Introduction

I have a K3S cluster running on Oracle Cloud Infrastructure (OCI). Now all my K3S worker nodes are changed or upgraded to Oracle Linux 9.1. Please refer to my previous article about upgrading worker node.

But the single master node (control plane) is still running on Oralce Linux 8.7.

[root@control01 /]# kubectl \
  get node control01 \
  -o yaml | grep -i kernel
    kernelVersion: 5.4.17-2136.313.6.el8uek.aarch64

I tried to use Leapp to perform an in-place upgrade of OL8 to OL9 for a K3S single master node. When I setup the testing K3S cluster, I did not setup the master node as a three cluster node. The outcome is that the master node is not bootable after using Leapp to upgrade.

For Leapp, please read the Linux vendor release notes for supported matrix and supported upgrade paths. For major upgrade of OS of a node with K3S (also consider CNI, storage plugins), I think it is not officially supported by K3S and Cilium (the CNI for my case).

Finally, if you use this article to perform system upgrade, it is at your own risk.

Upgrade from OL8 to OL9 on OCI

  • The master node to upgrade is called control01. Before the upgrade, it is running on OL 8.7.

  • Since there it is single master node, when I drain the master node, there is service interruption to the service provided by the cluster. As a side note of post-implementation, I would suggest draining all the pods on all the other worker nodes too.

  • Perform full data backup

  • Since I had tried to upgraded an OCI instance (ARM64) from OL 8.7 to OL 9.1 recently, I would use the updated procedure in below.

  • Please check your applications or system compatibility of changing kernel page size from 64KB to 4KB. Like this manual page about Oracle UEK7 kernel.

        
# On the master node of control plane
# Cordon and drain existing resources on itself
# kubectl cordon control01
# kubectl drain control01 --ignore-daemonsets --delete-emptydir-data --disable-eviction=true
# systemctl stop k3s
# systemctl disable k3s

# Backup K3S server status and settings
# tar -czvf /root/backup-k3s.tar.gz /var/lib/rancher/k3s/server/ /etc/rancher/

# dnf install leapp-upgrade --enablerepo=ol8_leapp,ol8_appstream,ol8_baseos_latest

# dnf config-manager --set-disabled ol8_UEKR6
# dnf config-manager --set-enabled ol8_UEKR7
# dnf update -y
# reboot
# It should return 4096 instead of 65536 after reboot
# getconf PAGESIZE
  • Zone Drifting should be off

        # sed -i "s/^AllowZoneDrifting=.*/AllowZoneDrifting=no/" /etc/firewalld/firewalld.conf
  • Run the preupgrade

        # --oci is correct for my case
# leapp preupgrade --oci
## --oraclelinux is for on-premise, using this on *OCI* may render your node un-bootable
## leapp preupgrade --oraclelinux
  • Next, is the actual upgrade process

        # The instance is on OCI
# leapp upgrade --oci
# Use --oraclelinux for on premises

# Verify the output in /var/log/leapp/

# reboot

The master node is not bootable

BAM! There is no response for the master node. Unlike my previous article. The upgrade failed.

These are the messages displayed when the node reboot. After a quick search on the web, it seems it is related to hardware firmware, UEFI support and secure boot.

EFI stub: Booting Linux Kernel...
EFI stub: EFI_RNG_PROTOCOL unavailable
EFI stub: Using DTB from configuration table
EFI stub: Exiting boot services...

I decided to delete this instance and install a new one with the same hostname.

Reinstall the master node

Then I terminate the node in OCI console. I re-create the node with the same hostname as before. The platform installation image is still OL 9.0. After it is provisioned, I upgrade it to OL 9.1.

  • Restore the necessary files from backup for K3S

    • /etc/rancher

    • /etc/systemd/system/k3s*

    • /var/lib/rancher/server

    • /usr/local/bin/k3s

Then start the K3S server

        # systemctl enable --now k3s
# systemctl status k3s
Click to view the output of systemctl status k3s
● k3s.service - Lightweight Kubernetes
     Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: disabled)
     Active: active (running) since Sat 2022-11-26 12:53:30 GMT; 5h 12min ago
       Docs: https://k3s.io
    Process: 2348 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
    Process: 2350 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 2355 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 2359 (k3s-server)
      Tasks: 309
     Memory: 1.3G
        CPU: 38min 56.002s
     CGroup: /system.slice/k3s.service
             ├─2359 "/usr/local/bin/k3s server"
             ├─2387 containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd -->
             ├─2571 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─2693 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─2723 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─3049 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─4302 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─4339 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─4372 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─4509 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─4580 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─4841 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─5034 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─5373 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─5433 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─5531 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─5793 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─6144 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             ├─6511 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >
             └─6669 /var/lib/rancher/k3s/data/03319a42bd191a541dd2fb18e572bf84e43905984afb83f1aca41e70cf220067/bin/containerd-shim-runc-v2 -namespace k8s.io >

Nov 26 18:05:37 control01 k3s[2359]: time="2022-11-26T18:05:37Z" level=debug msg="cgroupv2 io stats: skipping over unmappable dbytes=0 entry"
Nov 26 18:05:37 control01 k3s[2359]: time="2022-11-26T18:05:37Z" level=debug msg="cgroupv2 io stats: skipping over unmappable dios=0 entry"
Nov 26 18:05:37 control01 k3s[2359]: time="2022-11-26T18:05:37Z" level=debug msg="cgroupv2 io stats: skipping over unmappable dbytes=0 entry"
Nov 26 18:05:37 control01 k3s[2359]: time="2022-11-26T18:05:37Z" level=debug msg="cgroupv2 io stats: skipping over unmappable dios=0 entry"
        # kubectl get nodes
NAME        STATUS                     ROLES                  AGE     VERSION
node07      Ready                      <none>                 8d      v1.24.8+k3s1
control01   Ready,SchedulingDisabled   control-plane,master   507d    v1.24.8+k3s1
node02      NotReady                   <none>                 7h33m   v1.24.8+k3s1
node06      NotReady                   <none>                 56d     v1.24.8+k3s1

# kubectl uncordon control01

node02 and node06 is not responding via ssh, then I perform a force reboot on the instances. The K3S cluster automagically recovered to the state before.


Share this article


Related articles



Twitter responses: 0


Comments

No. of comments: 0

This site uses Akismet and Google Perspective API to reduce spam and abuses.
Please read and agree the privacy policy before using the comment system.