Kubernetes Upgrades Are Officially Boring
In the olden days π° (a few years ago) upgrading a Kubernetes cluster was a stressful task π° that could bring the system down π₯ or at the very least cause major disruption, deployment moratorium π«, and significant toil for most of the engineering team π. I’m happy to report that this is no longer the case π…
π Present day - California π
Yesterday I upgraded multiple EKS clusters from version 1.28 to version 1.31. It was the most uneventful upgrade ever π€. It just worked! β‘
I upgraded the control plane of each cluster from 1.28 β 1.29 β 1.30 β 1.31, and then upgraded all node groups from 1.28 β 1.31 in one fell swoop π.
I ran Pluto before I started to ensure we donβt use any deprecated or removed resources and APIs. We didn’t have any π! This is not a stroke of luck π. Kubernetes has become more stable. There are always Alpha and Beta versions of new resources, but the fundamental resources have been generally available for years. The innovation happens at the edges or remains backward-compatible.
Kubernauts π§βπ donβt have to “live in interesting times” anymore.
π The Olden days (Circa 2020) π
Kubernetes 1.16 was released on September 18, 2019. This was a major release since Kubernetes stopped serving several important API groups, such as:
- extensions/v1beta1
- apps/v1beta1
- apps/v1beta2
You may not be on a first-name basis π€ with API groups from 15 versions ago, so let me bring you up to speed π΅. All these critical resources were served from those API groups:
- Deployment
- StatefulSet
- ReplicaSet
- DaemonSet
- NetworkPolicy
- PodSecurityPolicy
If you upgraded from 1.15 to 1.16 and tried to apply an apps/v1beta1 Deployment Kubernetes will return an error like so:
echo 'apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: example
labels:
app: example
spec:
replicas: 3
template:
metadata:
labels:
app: example
spec:
containers:
- name: example
image: nginx
ports:
- containerPort: 80
selector:
matchLabels:
app: example' | kubectl apply -f -
error: resource mapping not found for name: "example" namespace: "" from "STDIN": no matches for
kind "Deployment" in version "apps/v1beta1" ensure CRDs are installed first
At the time, I worked for a company managing tens of clusters and hundreds of workloads. Engineers constantly pushed changes π οΈ, fiddling with manifests.
Kubernetes clusters can only move forward. Once upgraded, there is no way to roll back π, so getting it right the first time was critical. Tools similar to Pluto existed, but the stakes were too high π― and I didn’t trust them completely. We ran on GKE with lots of CRDs and operators generating resources dynamically.
So, I developed a tool called “Upgradanator” π‘οΈ to reduce upgrade risks. The tool replicated a live cluster and all its resources tested for incompatibilities.
π οΈ Enter the Upgradanator π οΈ
The Upgradanator provisioned a new 1.16 GKE cluster without node pools. It then generated YAML manifests for every resource that required upgrading.
Gigi, you may ask, how did it generate the YAML? Using kubectl get all
? Absolutely not! π« That
command is limited and doesn’t return all resources, especially not CRDs.
Instead, I used a neat kubectl plugin called ketall πͺ for the job.
Once the Upgradanator collected all YAML manifests, it applied them in dry-run mode on the 1.16 clusterβwithout nodes π₯οΈβsaving costs. Dry-run mode provides a complete compatibility check from the API server without provisioning any resources.
These days, we can take it further using vCluster π©, creating virtual clusters for testing instead of real ones.
Shout out to everyone involved in reliably and predictably releasing rock solid versions of Kubernetes every 3 months π.