How to perform a rolling change to nodes

Follow
Table of Contents

Task

In a Kubernetes cluster nodes can be treated as ephemeral building blocks providing the resources necessary for all workloads. Managing nodes in an immutable way is particularly common in a cloud environment.

In an on premise environment however, nodes can be recycled and updated, in general it's typical that nodes have a longer lifecycle.

There may be significant changes to nodes over time, for example: IP addresses, storage/filesystems migration to other hypervisors, data centers, large OS updates, or even migration between clusters.

To perform large changes like this, this article aims to provide example steps to apply large changes like this safely in a rolling fashion.

Pre-requisites

  • A custom or imported cluster managed by Rancher, or an RKE/k3s cluster
  • Access to the nodes in the cluster with sudo/root
  • Permission to perform drain and delete actions on the nodes

If there are any single replica workloads, whenever possible it is ideal to ensure at least 2 replicas are configured for availablity during rolling changes. These are best scheduled on separate nodes, a preferred anti-affinity can help with this.

Steps

While performing a rolling change to nodes you will need to determine a batch size, effectively how many nodes you wish to take out of service at a time. Initially, it is recommended to perform the change on one node as a canary first, and testing the change has the desired outcome before doing more at once.

  1. If you wish to maintain the number of nodes in the cluster while performing the rolling change, at this point you may wish to add new nodes, this ensures that when nodes are out of service the cluster maintains at least the original number of available nodes.

  2. Drain the node, this can be done with kubectl drain <node>, or in the Rancher UI.

    This is particularly important to avoid disruptions to services, by draining first, service endpoints are updated to remove the pods from services, stopped, started on a new node in the cluster, and added back to the service safely.

    If there are pods using local storage (commonly emptyDir volumes), and these should be drained, the --delete-local-data=true will be needed, beware: the data will be lost.

  3. Optional Delete the node(s) from the cluster, this can be done with kubectl delete <node>. This is needed for changes that cannot be performed on existing nodes, such as IP address, hostnames, moving nodes to another cluster, and large configuration updates. Any pods and Kubernetes components running on the nodes will be removed.

    Note: if this is an etcd node, ensure that the cluster has quorum and at least two remaining etcd nodes to maintain HA before performing this step.

  4. Optional If the node has been deleted in step 3, cleaning the node is important to ensure all previous history of the cluster, CNI devices, volumes, and containers are removed. This is especially important if the node is to be re-used in another cluster.

  5. Perform the changes to the node, this could be automated with configuration management, scripted or manual steps.

  6. Once step 5 is complete, add the node back to the desired cluster.

    • In a custom cluster this can be done with the docker run command supplied in the Rancher UI
    • For an imported cluster the steps are different
      • RKE, you would add this node to the cluster by configuring it in the cluster.yaml file, followed with an rke up
      • k3s, re-install k3s using the correct flags/variables
  7. Test the nodes with running workloads, and monitor before proceeding with the next node, or a larger batch size of nodes.

  8. If additional nodes were added in step 1, these can be removed from the cluster at this point by following steps 2, 3, and 4.

Was this article helpful?
1 out of 1 found this helpful

Comments

0 comments

Please sign in to leave a comment.