Page MenuHomeDevCentral

Document monitoring checks
Open, HighPublic

Description

When deploying a new service, an operational test was added to the legacy rTESTSPRODENV "prod-environment-behaves-correctly" tests collection.

Plan is to switch to NRPE checks for a Nagios-compatible monitoring solution.

Documentation and architecture choices are needed to guide monitoring contributions for new services.

Some questions to solve

QuestionPlan
Check scripts provisioningEach Salt role has a monitoring unit to deploy the scripts
Check scripts locationSomewhere in libexec like /usr/local/libexec/monitoring -> can we use a new entry in /map.jinja for common path across servers?
Check formatNagios NRPE with exit codes 0, 1, 2 or 3
Common libraryOverkill for portable NRPE checks, library would have just the exit codes. Need to maintain it in different languages like Bash and Python
Test suite for checksSpawn a container with a specific scenario, check output with bats
Minimal expected deliverable right nowmonitoring/files/check_* so check idea isn't lost, we'll add file.managed deploy logic later

NRPE exit codes

0SUCCESS
1WARNINGAny error to check after the critical ones.
2CRITICALIf the service is production, some feature is broken.
3UNKNOWNSomething is missing for the check to run properly and determine success or failure.

Event Timeline

dereckson triaged this task as High priority.Jul 21 2024, 13:39
dereckson created this task.

In D2648, NRPE directory has been set to dirs.share + "/monitoring/checks/nrpe", resolved on FreeBSD to /usr/local/share/monitoring/checks/nrpe directory.

Need to adopt this or edit https://devcentral.nasqueron.org/source/operations/browse/main/roles/core/monitoring/checks.sls$9