docs: upgrade docs for 1.23

This commit is contained in:
Stefan Reimer 2022-09-29 20:54:55 +02:00
parent f66bc6bfa0
commit 7dd5efb571
5 changed files with 29 additions and 99 deletions

View File

@ -38,7 +38,8 @@ clean: rm-test-image rm-image
.PHONY: rm-remote-untagged .PHONY: rm-remote-untagged
rm-remote-untagged: rm-remote-untagged:
@echo "Removing all untagged images from $(IMAGE) in $(REGION)" @echo "Removing all untagged images from $(IMAGE) in $(REGION)"
@aws ecr-public batch-delete-image --repository-name $(IMAGE) --region $(REGION) --image-ids $$(for image in $$(aws ecr-public describe-images --repository-name $(IMAGE) --region $(REGION) --output json | jq -r '.imageDetails[] | select(.imageTags | not ).imageDigest'); do echo -n "imageDigest=$$image "; done) @IMAGE_IDS=$$(for image in $$(aws ecr-public describe-images --repository-name $(IMAGE) --region $(REGION) --output json | jq -r '.imageDetails[] | select(.imageTags | not ).imageDigest'); do echo -n "imageDigest=$$image "; done) ; \
[ -n "$$IMAGE_IDS" ] && aws ecr-public batch-delete-image --repository-name $(IMAGE) --region $(REGION) --image-ids $$IMAGE_IDS || echo "Nothing to remove"
.PHONY: rm-image .PHONY: rm-image
rm-image: rm-image:

View File

@ -137,6 +137,9 @@ function _helm() {
local action=$1 local action=$1
local module=$2 local module=$2
# check if module is even enabled and return if not
[ ! -f $WORKDIR/kubezero/templates/${module}.yaml ] && { echo "Module $module disabled. No-op."; return 0; }
local chart="$(yq eval '.spec.source.chart' $WORKDIR/kubezero/templates/${module}.yaml)" local chart="$(yq eval '.spec.source.chart' $WORKDIR/kubezero/templates/${module}.yaml)"
local namespace="$(yq eval '.spec.destination.namespace' $WORKDIR/kubezero/templates/${module}.yaml)" local namespace="$(yq eval '.spec.destination.namespace' $WORKDIR/kubezero/templates/${module}.yaml)"

View File

@ -2,7 +2,7 @@ apiVersion: v2
name: kubezero-sql name: kubezero-sql
description: KubeZero umbrella chart for SQL databases like MariaDB, PostgreSQL description: KubeZero umbrella chart for SQL databases like MariaDB, PostgreSQL
type: application type: application
version: 0.2.0 version: 0.2.1
home: https://kubezero.com home: https://kubezero.com
icon: https://cdn.zero-downtime.net/assets/kubezero/logo-small-64.png icon: https://cdn.zero-downtime.net/assets/kubezero/logo-small-64.png
keywords: keywords:
@ -17,7 +17,7 @@ dependencies:
version: ">= 0.1.5" version: ">= 0.1.5"
repository: https://cdn.zero-downtime.net/charts/ repository: https://cdn.zero-downtime.net/charts/
- name: mariadb-galera - name: mariadb-galera
version: 7.4.2 version: 7.4.3
repository: https://charts.bitnami.com/bitnami repository: https://charts.bitnami.com/bitnami
condition: mariadb-galera.enabled condition: mariadb-galera.enabled
kubeVersion: ">= 1.20.0" kubeVersion: ">= 1.20.0"

View File

@ -1,5 +1,5 @@
mariadb-galera: mariadb-galera:
enabled: true enabled: false
replicaCount: 2 replicaCount: 2

View File

@ -2,115 +2,41 @@
## What's new - Major themes ## What's new - Major themes
- update inf1 neuron drivers incl. node auto-taints - Cilium added as second CNI to prepare full migration to Cilium with 1.24 upgrade
- support for Nvidia g5 instances incl. the whole toolchain up to device drivers etc, auto node taints - support for Nvidia g5 instances incl. pre-installed kernel drivers, cudo toolchain and CRI intergration
- ExtendedResourceToleration AdmissionController enabled to auto tolerate INF1 and Nvidia pods - updated inf1 neuron drivers
- Cluster-Autoscaler - ExtendedResourceToleration AdmissionController and auto-taints allowing Neuron and Nvidia pods ONLY to be scheduled on dedicated workers
- full Cluster-Autoscaler integration
### Alpine - Custom AMIs
Starting with 1.22, all KubeZero nodes boot using custom AMIs. These AMIs will be provided and shared by the Zero Down Time for all customers. As always, all sources incl. the build pipeline are freely available [here](https://git.zero-downtime.net/ZeroDownTime/alpine-zdt-images).
This eliminates *ALL* dependencies at boot time other than container registries. Gone are the days when Ubuntu, SuSE or Github decided to ruin your morning coffee.
KubeZero migrates from Ubuntu 20.04 LTS to [Alpine v3.15](https://www.alpinelinux.org/releases/) as its base OS.
#### Highlights:
- minimal attack surface by removing all unnecessary bloat,
like all things SystemD, Ubuntu's snap, etc
- reduced root file system size from 8GB to 2GB
- minimal memory consumption of about 12MB fully booted
*Minimal* fully booted instance incl. SSH and Monit:
| | Ubuntu | Alpine|
|-|--------|-----|
| Memory used | 60MB | 12 MB |
| RootFS used | 1.1GB | 330 MB |
| RootFS encrypted | no | yes |
| Kernel | 5.11 | 5.15 |
| Init | Systemd | OpenRC |
| AMI / EBS size | 8GB | 1GB |
| Boot time | ~120s | ~45s |
- Encrypted AMIs:
This closes the last gaps you might have in achieving *full encryption at rest* for every volume within a default KubeZero deployment.
### Etcd
On AWS a new dedicated GP3 EBS volume gets provisioned per controller and is used as persistent etcd storage. These volumes will persist for the life time of the cluster and reused by future controller nodes in each AZ.
This ensure no data loss during upgrade or restore situations of single controller clusters. The hourly backup on S3 will still be used as fallback / disaster recovery option in case the file system gets corrupted etc.
### DNS
The [external-dns](https://github.com/kubernetes-sigs/external-dns) controller got integrated and is used to provide DNS based loadbalacing for the apiserver itself. This allows high available control planes on AWS as well as bare-metal in combination with various DNS providers.
Further usage of this controller to automate any DNS related configurations, like Ingress etc. is planned for following releases.
### Container runtime
Cri-o now uses crun rather than runc, which reduces the memory overhead *per pod* from 16M to 4M, details at [crun intro](https://www.redhat.com/sysadmin/introduction-crun)
With 1.22 and the switch to crun, support for [CgroupV2](https://www.kernel.org/doc/Documentation/cgroup-v2.txt) has been enabled.
### AWS Neuron INF support
Initial support for [Inf1 instances](https://aws.amazon.com/ec2/instance-types/inf1/) part of [AWS Neuron](https://aws.amazon.com/machine-learning/neuron/).
Workers automatically load the custom kernel module on these instance types and expose the `/dev/neuron*` devices.
## Version upgrades ## Version upgrades
- Istio to 1.13.3 using the new Helm [gateway charts](https://istio.io/latest/docs/setup/additional-setup/gateway/) - Istio to 1.14.4
- Logging: ECK operator upgraded from 1.6 to 2.1, fluent-bit 1.9.3 - Logging: ECK operator to 2.4, fluent-bit 1.9.8
- Metrics: Prometheus and all Grafana charts to latest to match V1.22 - Metrics: Prometheus and all Grafana charts to latest to match V1.23
- ArgoCD to V2.2.5 - ArgoCD to V2.4 ( access to pod via shell disabled by default )
- AWS EBS/EFS CSI drivers to latest versions - AWS EBS/EFS CSI drivers to latest versions
- cert-manager to V1.8 - cert-manager to V1.9.1
- aws-termination-handler to 1.16
- aws-iam-authenticator to 0.5.7, required for >1.22 which allows using the latest version on the client side again
## Misc
- new metrics and dashboards for openEBS LVM CSI drivers
- new node label `node.kubernetes.io/instance-type` for all nodes containing the EC2 instance type
- kubelet root moved to `/var/lib/containers` to ensure ephemeral storage is allocated from the configurable volume rather than the root fs of the worker
# Upgrade # Upgrade
`(No, really, you MUST read this before you upgrade)` `(No, really, you MUST read this before you upgrade)`
- Ensure your Kube context points to the correct cluster ! - Ensure your Kube context points to the correct cluster !
- Ensure any usage of Kiam has been migrated to OIDC providers as any remaining Kiam components will be deleted as part of the upgrade
1. Migrate ArgoCD KubeZero config: 1. Enable `containerProxy` for NAT instances and upgrade NAT instance using the new V2 Pulumi stacks
`cat <cluster/env/kubezero/application.yaml> | ./releases/v1.22/migrate_agro.py` and adjust if needed and replace the original. Do NOT commit yet !
2. Upgrade `logging` and `metrics` module 2. Review CFN config for controller and workers ( enable containerProxy, remove legacy version settings etc )
- `kubectl get crd elasticsearches.elasticsearch.k8s.elastic.co && kubectl replace -f https://download.elastic.co/downloads/eck/2.1.0/crds.yaml` CRDs for logging
- `./bootstrap.sh apply logging <env>` logging module to support new OS coming with 1.22
- `./bootstrap.sh crds metrics <env>` CRDs for metrics
- `./bootstrap.sh apply metrics <env>` to get new exporters in place to support 1.22
3. Trigger the cluster upgrade: 3. Upgrade CFN stacks for the control plane and all worker groups
`./release/v1.22/upgrade_cluster.sh`
4. Upgrade CFN stacks for the control plane and all worker groups 4. Trigger fully-automated cluster upgrade:
Change Kubernetes version in controller config from `1.21.9` to `1.22.8` `./admin/upgrade_cluster.sh <path to the argocd app kubezero yaml for THIS cluster>`
5. Reboot controller(s) one by one 5. Reboot controller(s) one by one
Wait each time for controller to join and all pods running. Wait each time for controller to join and all pods running.
Might take a while ... Might take a while ...
6. Launch new set of workers, at least enough to host new Istio Ingress gateways due to Kernel requirements 6. Launch new set of workers eg. by doubling `desired` for each worker ASG
Eg. by doubling `desired` for each worker ASG, once new workers are ready, cordon and drain all old workers
The cluster-autoscaler will remove the old workers automatically after about 10min !
7. Upgrade via boostrap.sh 7. If all looks good, commit the ArgoApp resouce for Kubezero, before re-enabling ArgoCD itself.
As the changes around Istio are substantial in this release we need to upgrade some parts step by step to prevent service outages, especially for private-ingress. git add / commit / push `<cluster/env/kubezero/application.yaml>`
- `./bootstrap.sh crds all <env>` to deploy all new CRDs first
- `./bootstrap.sh apply cert-manager <env>` to update cert-manager, required for Istio
- `./bootstrap.sh apply istio <env>` to update the Istio control plane
- `./bootstrap.sh apply istio-private-ingress <env>` to deploy the new private-ingress gateways first
- `./bootstrap.sh apply istio-ingress <env>` to update the public ingress and also remove the 1.21 private-ingress gateways
8. Finalize via ArgoCD
git add / commit / pusSh `<cluster/env/kubezero/application.yaml>` and watch ArgoCD do its work.
9. Drain old workers
Drain one by one and reset each ASG to initial "desired" value.