KubeZero 1.24

TODO

Cilium added as second CNI to prepare full migration to Cilium with 1.24 upgrade
support for Nvidia g5 instances incl. pre-installed kernel drivers, cudo toolchain and CRI intergration
updated inf1 neuron drivers
ExtendedResourceToleration AdmissionController and auto-taints allowing Neuron and Nvidia pods ONLY to be scheduled on dedicated workers
full Cluster-Autoscaler integration

(No, really, you MUST read this before you upgrade)

Enable containerProxy for NAT instances and upgrade NAT instance using the new V2 Pulumi stacks
Review CFN config for controller and workers ( enable containerProxy, remove legacy version settings etc )
Upgrade CFN stacks for the control plane and all worker groups
Trigger fully-automated cluster upgrade:
./admin/upgrade_cluster.sh <path to the argocd app kubezero yaml for THIS cluster>
Reboot controller(s) one by one
Wait each time for controller to join and all pods running. Might take a while ...
Launch new set of workers eg. by doubling desired for each worker ASG
once new workers are ready, cordon and drain all old workers
The cluster-autoscaler will remove the old workers automatically after about 10min !
If all looks good, commit the ArgoApp resouce for Kubezero, before re-enabling ArgoCD itself.
git add / commit / push <cluster/env/kubezero/application.yaml>