I have the alert 'Node Disk is Running Full in 24 hours', what does this mean?

Follow
Table of Contents

Issue

When using Rancher Alerting you may see the Alert 'Node Disk is running full in 24 Hours', however you may not understand what this means and what the impact of this is.

Background

The alert is an early warning for potentially more serious issues with the disk space on a node becoming full.

It uses the following equation:

predict_linear(node_filesystem_free_bytes{mountpoint!~"^/etc/(?:resolv.conf|hosts|hostname)$"}[6h], 3600 * 24) < 0

This equation measures the rate at which the disk is losing available space and based on the trend from the last 6 hours will warn you that, at that rate, in 24 hours the disk will have no available space at all.

It is important to make sure that you investigate the affected node for any potential issues with excessive logging, under-provisioned disk space or perhaps something else that could cause such a situation.

Impact

If the Node does run out of disk space, this will cause a 'Disk Pressure' event on the node.

When a Node experiences Disk Pressure it will evict all running containers and then become unschedulable until the requisite disk space is made.

This is a serious situation, especially if the cause of the failure is a rogue container over-logging.

If an over-logging or failing workload is forced to reschedule on another node they may all end up becoming unschedulable as the issue will follow the workload when it is rescheduled.

In a typical Kubernetes installation Disk Pressure is caused by available space being less than 10%.

Investigative Steps

The cause of this issue can vary heavily across different environments, but as the alert is node specific you should start there.

The first place to investigate is a df -h, this will show you the percentage of disk space that is filled on your node, you may be able to identify immediately a place where disk space is no longer available.

This is the fastest way to assess more urgent issues and once you've identified the disk that may be reporting as nearly full you can immediately take precautions, such as clearing out old log files or increasing the size of a disk.

As you use Rancher Monitoring you can also look at Node-specific statistics graphed over time and make an assessment on when the issue began, compared to running workloads and other node logs.

During operations seeing an increase in either logging or a gradual increase in storage used is often expected.

For example, if the alert is sounding because a specific workload encountered issues, logged more but then recovered and cleaned up the logs; you may no longer have an issue on your hands but could consider reducing down the logging of the workload to prevent further alerts.

The 'Node Disk is running full in 24 Hours' is a preemptive alert that will always require investigation and understanding to ensure the best operational health of your Nodes and Clusters.

Short-Term Solutions

If you can identify the reason for this alert, the solution should hopefully be more straight-forward. It may be you need to delete some old files on a Node, or you may need to reduce logging; for example, if debug logging is running.

If you do hit a Disk Pressure event and you need to recover you need to access the node directly and reduce the amount of space taken manually, when requisite space is made you should either restart the Node or Docker on the Node to mark it as schedulable in Kubernetes again.

Long-Term Solutions

There are different solutions that will mitigate this alert:

  1. Having larger disks

While Rancher does not have specific requirements for disk space on a Node, it would be recommended to at least have 30GB or more to better mitigate this alerts occurrence.

  1. Exporting container logs

Rancher provides a logging deployment you can configure to export your container logs, for example, to an in-house Elasticsearch Cluster.

While these logs will be buffered locally they will then be exported remotely, thereby reducing the amount of accumulated logs over time.

  1. Running regular cleanups on System Logs

While out-of-scope of Rancher, a large of amount of system logs can contribute to this alert sounding, it is encouraged to manage logging at an OS Level with either logrotate or by exporting logs.

Further reading

Kubernetes Out of Resource Handling Documentation: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/

Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.