Rancher log forwarding to an Elasticsearch endpoint stops functioning as a result of connection reload behaviour in Rancher v2.3, prior to v2.3.8, and v2.4, prior to v2.4.4

Follow
Table of Contents

Issue

In Rancher v2.0 - v2.2, v2.3 prior to v2.3.8, and v2.4 prior to v2.4.4, a previously functioning log forwarding configuration to an Elasticsearch instance could stop successfully forwarding logs, without any configuration change and whilst the Elasticsearch endpoint was still available. The logs of the rancher-logging-fluentd Pod(s) in the cattle-logging Namespace of the affected cluster, reveal log messages of the following format:

failed to flush the buffer, retry_time=0, next_retry_seconds=2019-07-24 07:07:31 +0000, chunk=58e67cdcd7d1406de13fe55a26fe6cad, error_class=Fluent::Plugin::ElasticSearchOutput::RecoverableRequestFailure error="could not push logs to ElasticSearch cluster ({:host=>elasticsearch.example.com, :port=>443, :scheme=>\"https\"}): connect_write timeout reached"

or

failed to flush the buffer. retry_time=10 next_retry_seconds=2019-07-24 07:07:31 +0000 chunk="58e67cdcd7d1406de13fe55a26fe6cad" error_class=Elasticsearch::Transport::Transport::Error error="Cannot get new connection from pool."

Pre-requisites

  • A Rancher v2.x instance, running Rancher v2.0 - v2.2, v2.3 prior to v2.3.8, or v2.4 prior to v2.4.4
  • A Rancher managed cluster with Rancher log forwarding configured to an Elasticsearch endpoint

Root cause

By default the fluent-plugin-elasticsearch fluentd plugin will attempt to reload the host list from elasticsearch after 10000 requests. This behaviour is a result of default functionality in the elasticsearch-ruby gem, as documented in the plugin's FAQ. This reload behaviour is not compatible with all elasticsearch environments, and failure of the reload results in the plugin failing to forward further log events.

Resolution

In Rancher v2.3, from v2.3.8, and Rancher v2.4, from v2.4.4, the Rancher log forwarding configuration for Elasticsearch endpoints was updated to include the option reload_connections false. This disables the default connection reload behaviour, preventing occurrences of this issue.

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.