How to recover a cluster when all control plane nodes have failed

Follow
Table of Contents

Task

In a disaster recovery scenario, the control plane and etcd nodes managed by Rancher in a downstream cluster may no longer be available or functioning. The cluster can be rebuilt by adding control plane and etcd nodes again, followed by restoring from an available snapshot.

Pre-requisites

  • A cluster built by Rancher v2.x or the Rancher Kubernetes Engine CLI (RKE)
  • Nodes to add to the cluster with control plane and etcd roles with adequate resources
  • An offline copy of a snapshot to be used as the recovery point, often stored in S3 or copied off node filesystems to a backup location

Note: This article assumes that all control plane and etcd nodes are no longer functional and/or cannot be repaired via any other means, like a VM snapshot restore.

Steps

To recover the downstream cluster, any existing nodes with the control plane and/or etcd roles must be removed. Worker nodes can remain in the cluster, and these may continue to operate with running workloads.

Please use the following steps as a guideline to recover the cluster, from this point the cluster that has experienced the disaster will be referred to as the downstream cluster.

  1. As a precaution, it's recommended to take a snapshot of the Rancher local cluster. Please see the documentation for the appropriate way to take a snapshot for the Rancher installation.

    Alternatively the rancher-backup operator can be used to backup all of the related objects for restoration.

  2. Delete all nodes with the control plane and/or etcd roles from the downstream cluster in the Rancher UI.

    The delete action can fail when the downstream cluster is in this condition, if nodes do not get removed, follow the below to remove it from the cluster: 1. Click on the node and select View in API, click the delete button for the object 2. If this does not succeed, using kubectl or the Cluster Explorer for the Rancher local cluster, edit the corresponding nodes.management.cattle.io object in the namespace that matches the downstream cluster ID to remove the finalizers field

  3. Add a clean node back to the cluster with the all role (control plane, etcd, worker). The IP address does not have to match any of the previous nodes. If the node has previously been used in a cluster, use the extended cleanup script steps to remove any previous configuration.

    The newly added node will fail to successfully register to the downstream cluster, it won't proceed past "Waiting to register with Kubernetes", this is normal.

  4. Copy the snapshot into place on the new node, under the /opt/rke/etcd-snapshots directory structure.

    The filename must match a snapshot name in the list of snapshots in the Rancher UI for the downstream cluster, any snapshot should be usable, if the name is different, rename the file to match one of the known snapshots in the list.

  5. Initiate a snapshot restore from Rancher UI using the same snapshot name used in the previous step.

  6. Monitor the Rancher pod logs for progress.

    To follow all pod logs at once, a kubeconfig for the Rancher local cluster can be used with this kubectl command: - kubectl logs -n cattle-system -l app=rancher -f -c rancher

  7. Once the new node reaches the active state, check the cluster and add additional nodes by repeating step 3 when ready, the additional nodes can be added with only the control plane and etcd roles if desired.

As a follow up, once all desired nodes are added and the cluster is healthy, the control plane and etcd node roles can be configured as needed. For example, if the all role is not needed, update the the node by removing and adding the node again in a rolling fashion.

Further reading

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.