How to recover an RKE v0.2.x, v0.3.x or v1.x.x cluster after restoration with an incorrect or missing rkestate file

Follow
Table of Contents

Issue

When using RKE (Rancher Kubernetes Engine) v0.2.x, v0.3.x, v1.0.x or v1.1.0, if you have restored a cluster with the incorrect or missing rkestate file you will end up in a state where your infrastructure pods will not start. This includes all pods in the kube-system, cattle-system and ingress-nginx namespaces. As a result of these stopped infrastructure pods, workload pods will not function correctly. If you find yourself in this situation you can use the directions below to fix the cluster. For more information about the cluster state file, please see RKE documentation on Kubernetes Cluster State.

Pre-requisites

  • RKE v0.2.x, v0.3.x, v1.0.x or v1.1.0
  • A cluster restoration performed with the incorrect or missing rkestate file

Workaround

  1. Delete all service-account-token secrets in kube-system, cattle-system and ingress-nginx namespaces:

    kubectl get secret -n cattle-system | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n cattle-system delete secret " $1) }'
    kubectl get secret -n kube-system | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n kube-system delete secret " $1) }'
    kubectl get secret -n ingress-nginx | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n ingress-nginx delete secret " $1) }'
    kubectl get secret -n cert-manager | awk '{ if ($2 == "kubernetes.io/service-account-token") system("kubectl -n cert-manager delete secret " $1) }'
  2. Restart Docker on all nodes in the cluster currently:

    systemctl restart docker
  3. Force delete all pods stuck in a CrashLoopBackOff, Terminating, Error and Evicted state:

    kubectl get po --all-namespaces | awk '{ if ($4 =="CrashLoopBackOff") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }'
    kubectl get po --all-namespaces | awk '{ if ($4 =="Terminating") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }'
    kubectl get po --all-namespaces | awk '{ if ($4 =="Error") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }'
    kubectl get po --all-namespaces | awk '{ if ($4 =="Evicted") system("kubectl delete po --force --grace-period=0 -n " $1 " " $2) }'
  4. Once your force delete has finished, restart Docker again to clear out any stale containers from the above force delete command:

    systemctl restart docker
  5. You may have to delete service account tokens more than once or delete pods more than once. After you go through the guide once, monitor pod statuses with a watch command in one terminal as shown below.

    watch -n1 'kubectl get po --all-namespaces | grep -i  "cattle-system\|kube-system\|ingress-nginx\|cert-manager"'

    If you see any pods still in an error state, you can describe them to get idea of what is wrong. Most likely you'll see an error like the following which indicates that you need to delete its service account tokens again.

    Warning  FailedMount  7m23s (x126 over 4h7m)  kubelet, 18.219.82.148  MountVolume.SetUp failed for volume "rancher-token-tksxr" : secret "rancher-token-tksxr" not found
    Warning  FailedMount  114s (x119 over 4h5m)   kubelet, 18.219.82.148  Unable to attach or mount volumes: unmounted volumes=[rancher-token-tksxr], unattached volumes=[rancher-token-tksxr]: timed out waiting for the condition

    Delete the service account tokens again for that one namespace so that pods in other namespaces don't have to be disturbed if they are good. Once the service account tokens are deleted, run a delete pod command for just the namespace with pods still in an error state. cattle-node-agent and cattle-cluster-agent depend on the Rancher pod to be online, so you can ignore those until the very end. Once Rancher pods are stable, go back in and delete all the agents again to get them to restart more quickly.

Resolution

An update to enable successful restoration of an RKE provisioned cluster without the correct rkestate file is targetted for an RKE v1.1.x patch release. For more information please see RKE GitHub issue #1336.

Further Reading

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.