When deploying a new service, an operational test was added to the legacy rTESTSPRODENV "prod-environment-behaves-correctly" tests collection.
Plan is to switch to NRPE checks for a Nagios-compatible monitoring solution.
Documentation and architecture choices are needed to guide monitoring contributions for new services.
Some questions to solve
Question | Plan |
Check scripts provisioning | Each Salt role has a monitoring unit to deploy the scripts |
Check scripts location | Somewhere in libexec like /usr/local/libexec/monitoring -> can we use a new entry in /map.jinja for common path across servers? |
Check format | Nagios NRPE with exit codes 0, 1, 2 or 3 |
Common library | Overkill for portable NRPE checks, library would have just the exit codes. Need to maintain it in different languages like Bash and Python |
Test suite for checks | Spawn a container with a specific scenario, check output with bats |
Minimal expected deliverable right now | monitoring/files/check_* so check idea isn't lost, we'll add file.managed deploy logic later |
NRPE exit codes
0 | SUCCESS | |
1 | WARNING | Any error to check after the critical ones. |
2 | CRITICAL | If the service is production, some feature is broken. |
3 | UNKNOWN | Something is missing for the check to run properly and determine success or failure. |