Many rancher-agent containers running on Rancher v2.x provisioned Kubernetes cluster, where stopped containers are regularly deleted on hosts

Follow
Table of Contents

Issue

On a Rancher v2.x provisioned cluster, a host shows a large number of containers running the rancher-agent image, per the following output of docker ps | grep rancher-agent:

$ docker ps | grep rancher-agent
...
aeffe9725521        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   About a minute ago   Up About a minute                       sleepy_hopper
130120f49b71        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   6 minutes ago        Up 6 minutes                            stoic_hypatia
498b923d9b6e        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   11 minutes ago        Up 11 minutes                            laughing_elbakyan
3453865e5f70        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   16 minutes ago        Up 16 minutes                            wonderful_gagarin
f925209cd16a        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   21 minutes ago       Up 21 minutes                           silly_shannon
7d7fb5d4bf04        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   26 minutes ago       Up 26 minutes                           gifted_elgamal
...

A docker inspect <container_id> for these containers, shows the Path and Args are of the following format:

"Path": "run.sh",
"Args": [
    "--server",
    "https://167.172.96.240",
    "--token",
    "gwrp7zlnwvsnzh2nhbvwcgdw45ccv6cq9pztzdd92j6xlv69xxhvnp",
    "--ca-checksum",
    "bbc8c7ca05c87a7140154554fa1a516178852f2710538c57718f4c874c29533c",
    "--no-register",
    "--only-write-certs"
],

Pre-requisites

  • A Rancher v2.x provisioned Kubernetes cluster, using either custom nodes or nodes hosted in an infrastructure provider.
  • Repeated deletion of stopped containers on hosts in the cluster, e.g. use of docker system prune, either manually or as part of an automated process such as a cronjob.

Root cause

This behaviour is a result of the issue tracked in Rancher GitHub issue #15364.

The share-mnt container is created on a Rancher provisioned Kubernetes cluster, and exits upon completion, but is not removed such that it can be invoked again.

Meanwhile, the Rancher node-agent Pod on a host will spawn a new share-mnt container, if the share-mnt is removed. Upon starting, the share-mnt process spawns a rancher-agent container to write certificates. This agent container will run indefinitely until the node-agent is triggered to reconnect to the Rancher server or the node-agent process is restarted.

As a result, where the share-mnt container on a host is removed repeatedly, either manually or by an automated process, this will result in multiple running rancher-agent containers.

Workaround

To trigger automatic removal of the rancher-agent containers, the node-agent container on the host can be restarted. Identifying the running agent container with docker ps | grep k8s_agent_cattle-node restart the container with docker restart <container_id>.

In addition, you can prevent further creation of multiple rancher-agent container instances by removing whichever process is triggering the deletion of stopped containers.

Resolution

An enhancement request, to prevent the creation of multiple long-running rancher-agent containers, in the event of repeated deletion of the share-mnt container, is tracked in Rancher GitHub issue #15364.

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.