Issue - Nodes are not added to Rancher or are not provisioned correctly
The following article should help empower Rancher administrators diagnose and troubleshoot when a node is not added to Rancher or when a node is not provisioned correctly. We'll outline the process nodes undergo when they are added to a cluster.
We'll kick off by scoping what cluster types this document might pertain to. We're speaking specifically about custom clusters and clusters launched with a node driver. Mention of node driver will be synonymous with 'With RKE and new nodes in an infrastructure provider' in the Rancher UI.
Tracing the steps during the bootstrapping of a node.
Whether you're selecting custom clusters or clusters launched with a node driver, the way to add nodes to the cluster is by executing a docker run command generated for the created cluster. In case of a custom cluster, the command will be generated and displayed on the final step of cluster creation. In case of a cluster launched with a node driver, the command is generated and executed as final command after creating the node and installing Docker.
Note: not all roles may be present in the generated command, depending on what role(s) is/are selected.
sudo docker run -d \ --privileged \ --restart=unless-stopped \ --net=host \ -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:<version> --server https://<server_url> \ --token <token> \ --ca-checksum <checksum_value> \ --etcd \ --controlplane \ --worker
What happens next:
docker run command launches a bootstrap agent container. It will be identified with a randomly generated name
- The entrypoint is a shell script which parses the flags and runs some validation tests on said flags and their provided values
- A token is then used to authenticate against your Rancher server in order to interact with it.
- The agent retrieves the CA certificate from the Rancher server and places it in /etc/kubernetes/ssl/certs/serverca, then the checksum is used to validate if the CA certificate retrieved from Rancher matches. This only applies when a self signed certificate is in use.
- Runs an agent binary and connects to Rancher using a WebSocket connection
- Agent then checks in with the Rancher server to see if the node is unique, and gets a node plan
- Agent executes the node plan provided by the Rancher server
- Docker run command will create the path
/etc/kubernetes if it doesn't exist
- Rancher will run cluster provisioning/reconcile based on the desired role for the node being added (etcd and control plane nodes only). This process will copy certificates down from the server via the built in rke cluster provisioning.
- On worker nodes, the process is slightly different. The agent requests a node plan from the Rancher server. The Rancher server generates the node config then sends it back down to the agent. The agent then executes the plan contained in the node config. This involves; certificate generation for the Kubernetes components, and the container create commands to create the following services; kubelet, kube-proxy, and nginx-proxy.
- The Rancher agent uses the node plan to write out a cloud-config to configure cloud provider settings.
If provisioning of the node succeeds, the node will be registering to the Kubernetes cluster and cattle-node-agent DaemonSet pods will be scheduled to the node, and the pod will remove and replace the agent container that was created via the Docker run command
share-mntbinary (aka bootstrap phase 2) - The share-mnt container runs the share-root.sh which creates filesystem resources that other container end up using. Certificate folders, configuration files, etc... - The container spings up another container that runs a share mount binary. This container makes sure /var/lib/kubelet or /var/lib/rancher have the right share permissions for systems like boot2docker.
Note: All Kubernetes control plane components talk directly with the Kubernetes API server that's housed on the same node. This proxy is configured to front all k8s API servers within the cluster. It's nginx.conf should reflect that.
- If all goes well, the share-mnt bootstrap and share-root container exit and the share-root container gets removed. The kubelet starts, registers with Kubernetes, and cattle-node-agent
DaemonSetschedules a pod. The pod should then take over the websocket connection to the rancher server. This should end our provisioning journey and hopefully lead to a functional, happy cluster.