What is the process performed by Rancher v2.x when upgrading a Rancher managed Kubernetes cluster?
- Running Rancher v2.0.x - v2.3.x. Note, Kubernetes upgrades will be changing in v2.4.x, see Further Reading below.
- RKE CLI v0.2.x+
Rancher, either through the UI or API, can be used to upgrade a Kubernetes cluster that was provisioned using the "Custom" option or on cloud infrastructure such as AWS EC2 or Azure. This can be accomplished by editing the cluster and selecting the desired Kubernetes version. Clusters provisioned with the RKE CLI can also be upgraded by editing the kubernetes_version key in the cluster YAML file. This will trigger an update of all the Kubernetes components in the order listed below:
Each etcd container is updated, one node at a time. If the etcd version has not changed between versions of Kubernetes, no action is taken. The process consists of:
- Downloading etcd image
- Stopping and renaming old etcd container (backend datastore is preserved on host)
- Creating and starting new etcd container
- Running etcd health check
- Removing old etcd container
For RKE CLI provisioned clusters, the etcd-rolling-snapshot container is also upgraded if a new version is available.
Every Kubernetes update will require the control plane components to be updated. All control plane nodes are updated in parallel. The process consists of:
- Downloading hyperkube image, which is used by all control plane components.
- Stopping and renaming old kube-apiserver container
- Creating and starting new kube-apiserver container
- Running kube-apiserver health check
- Removing old kube-apiserver container
- Stopping and renaming old kube-controller-manager container
- Creating and starting new kube-controller-manager container
- Running kube-controller-manager health check
- Removing old kube-controller-manager container
- Stopping and renaming old kube-scheduler container
- Creating and starting new kube-scheduler container
- Running kube-scheduler health check
- Removing old kube-scheduler container
Every Kubernetes update will require the worker components to be updated. These components run on all nodes, including the control plane and etcd. Nodes are updating in parallel. The process consists of:
- Downloading hyperkube image (if not already present)
- Stopping and renaming old kubelet container
- Creating and starting new kubelet container
- Running kubelet health check
- Removing old kubelet container
- Stopping and renaming old kube-proxy container
- Creating and starting new kube-proxy container
- Running kube-proxy health check
- Removing old kube-proxy container
Addons & user workloads
Once Kubernetes etcd, control plane, and worker components have been updated, the latest manifests for addons are applied. This includes, but is not limited to KubeDNS/CoreDNS, Nginx Ingress, Metrics Server, and CNI plugin (Calico, Weave, Flannel, Canal). Depending on the manifest deltas and the upgrade strategy defined in the manifest, pods and their corresponding containers may or may not be removed and recreated. Please be aware that some of these addons are critical for your cluster to operator correctly and you may experience brief outages if these workloads are restarted. For example, when KubeDNS/CoreDNS is restarted, you could have issues resolving hostname to IP addresses. When the Nginx Ingress is restarted, layer 7 http/https traffic from outside your cluster to your workloads may get interrupted. When your CNI plugin is restarted on each node, the workloads running on the node may temporarily not be able to reach workloads running on other nodes. The best way to minimize outages or disruptions is to make sure you have proper fault tolerance in your cluster.
The kubelet automatically destroys and recreates all user workload pods when the spec hash value is changed. This value will change for a pod if the Kubernetes upgrade involves any field changes in the pod manifest, such as a new field or the removal of a deprecated field. As a best practice, it's best to assume all your pods and containers will be destroyed and recreated during a Kubernetes upgrade. This is more likely to happen for major/minor releases and less likely for patch releases.
Upgrade refactor in v2.4: https://github.com/rancher/rancher/issues/23038
Kubeadm upgrades: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/