Troubleshoot 1: Not able to pinpoint exactly what is causing Disk Pressure on AKS node.

In our AKS Development environment, we use this virtual machine https://cloudprice.net/vm/Standard_B4ms as the underlying machine for our nodes in our node pool named Platform.

Currently, we are seeing that we have disk pressure on this node.

As can be seen from the photo’s, the disk usage is right now around 25GB. The node disk space capacity is 32Gib, which is around 34.36 GB (Gigabytes).

The disk usage is approximately 72.76%. This means that 72.76% of the total disk space is currently being used, and 27.24% of the disk space is still available.In this picture above by AKS it is stated that disk usage is 83%.

Why this difference? I am not completely sure to be honest.

That’s what I am first going to find out.

5 minutes later: Not clear what is the reason for this discrepancy, but for now I am going to assume there as just additional processes Azure doesn't take account of when showing that Disk chart. Also, our Prometheus instance shows disk usage as being 83%, so we are going with that number:

Now, let the real troubleshooting commence.

We have to find out what pod uses the most disk space in our Development environment, since I want to kill - sorry.. destroy — still to extreme .. stop that pod.

A quick background of our set up at Sensey:

Note: We like to use other persons open source software for our own needs, while we are an evil company that only sells SaaS and doesn’t contribute to open source.

With this additional note in mind, this is our software stack:

Rancher (For shared Kubernetes usage. Local access is disabled. Everything must go through our Rancher instance.)
PgAdmin (Database movements, local connections are closed both in Development, Staging and Production)
Prometheus/Grafana Stack (Metric Analysis)
Elastic (Logging)
Keycloak (Our Identity Broker)
Traefik (Our Ingress Controller)

Now we must find the application pod, that uses the most of this disk space. But .. how?

Assume we are not paying for any SaaS again, and that we do not have any visibility in this. How to approach it?

We can SSH into the Node.

I have SSH’ed into the node with the Lens IDE, and this is my output:

In the photo above, you can see that I immediately have run the command:

df -h

The pod filesystems can be find with an easy grep:

df -h | grep overlay