This article aims to provide steps to gather and analyse data to help troubleshoot ingress performance issues.
- A cluster built by Rancher v2.x or Rancher Kubernetes Engine (RKE)
- Use of the Nginx Ingress Controller
Review request times
To narrow down on which requests are taking the longest, analyzing the ingress-nginx logs is very helpful.
Retrieve requests with a high (>2s)
upstream_response_time. This log field represents the time taken for the response from the upstream target - the pod endpoint in the service.
kubectl logs -n ingress-nginx -l app=ingress-nginx -f --tail=2000 | awk '/- -/ && $(NF-2)>2.0'
The same can be done for
request_time, this represents the time taken to complete the entire request, including the above
kubectl logs -n ingress-nginx -l app=ingress-nginx -f --tail=2000 | awk '/- -/ && $(NF-7)>2.0'
Please adjust the time to suit, where
>2.0 will filter for any times greater than 2.0 seconds.
Comparing the diference in timings between
upstream_response_time can help to understand the issue further:
- Locate any potential upstream targets (pods), or nodes these may be running on, that are frequently associated with a higher
- If all upstream targets in a particular ingress/service are experiencing higher response times:
- What dependencies does the application have? For example, external APIs, databases, other services, etc - Investigate the application logs - Simulate the same requests directly to pods to bypass ingress-nginx, are they also slow?
- If the
upstream_response_timeis much lower than
request_time, the time is being spent elsewhere, check any tuning, performance or resource issues on the nodes
request_timemetric is also used to create the ingress controller graphs when Cluster Monitoring is enabled.
Review request details
Along with the output in the previous step, it is also useful to analyse the request details, such as the request itself, source/destination IP address, response code, user agent, and the unique name for the ingress for common patterns.
You may need to review these with the related application teams. For example, a request to retrieve a large amount of data, or perform a complex query may genuinely take a long time, these can potentially be ignored.
Some requests may be opening a websocket, and in the scenario that the service scales up/down regularly, a small number of upstream targets could have a long-running connection creating an unfair distribution to occur on these targets.
It's also worthwhile to consider the time when the issue occurs, the number of pods in the service, performance metrics, and requests/limits in place. For example, do the requests occur during a peak load time? Is HPA configured to scale the deployment? Is monitoring data available to identify trends and correlate with the logs?
Check ingress-nginx logs
With the focus previously on requests themselves, it is also useful to exclude the access logs and ensure there are no fundamental issues with ingress-nginx.
The following command should exclude all access.log output, retrieving output from the ingress controller and the nginx error.log only.
kubectl logs -n ingress-nginx -l app=ingress-nginx -f --tail=100 | awk '!/- -/'
Please adjust the
--tail flag as needed, this example retrieves the last 100 lines from each ingress-nginx pod.
Real-time view of all requests
kubectl logs -f -n ingress-nginx -l app=ingress-nginx --tail=2000 | goaccess --log-format="%h - - [%d:%t] \"%m %r %H\" %s %b \"%R\" \"%u\" %^ %T [%v]" --time-format '%H:%M:%S %z' --date-format "%d/%b/%Y"
Please adjust the history of logs with the
Measure requests to ingress-nginx
If you have isolated all areas so far, it might be worthwhile to focus on the Load Balancer or network devices that provide client access to ingress-nginx.
The following articles contain curl commands to perform SNI-compliant requests and measure statistics, these requests could also be compared from the ingress-nignx logs (as above) to understand what portion of the time was spend with ingress-nginx handling the request.
You may also be able to obtain metrics from your Load Balancer or infrastructure to troubleshoot this further.