A Tip for Deploying Prometheus

Over the last few months I’ve been deploying Prometheus and Grafana to monitor our production environment for customers using our application. The above picture is showing endpoints for different services we’re using in the cloud that are running Prometheus exporters for metrics data.

The hardest part of this has been getting a better understanding of the value.yaml file that is included in the Helm deployment of Prometheus. Here are a few tips that I’ve discovered deploying monitoring that could’ve saved me a ton of time from the get go.

Alerting Rules to Live By

If an alert is not actionable, delete it.

If an alert is transient, change the thresholds to bring it to an actionable state (something is REALLY down - 5 consecutive failed checks in 5 mins or similar).


If an alert is actionable:
Is it a priority for the hour? Minor alerts can wait until working hours to alert.


Automatically send it to the team responsible (not you hopefully)
If you see more than once a month
Automate the solution if you can.
Or Automatically assign it to the team that needs to fix it and page them first until its fixed.


Have a sane rotation. 1w on 1w off is not sane. 3 people minimum.
Include your manager in the rotation.


This probably requires you to make runbooks, which you should have.


Opsgenie and Pagerduty are fine.

Leave a Reply