Maniphest T1978

Document monitoring checks
Open, HighPublic
Actions

Assigned To

None

Authored By

	dereckson
	Jul 21 2024, 13:39

Description

When deploying a new service, an operational test was added to the legacy rTESTSPRODENV "prod-environment-behaves-correctly" tests collection.

Plan is to switch to NRPE checks for a Nagios-compatible monitoring solution.

Documentation and architecture choices are needed to guide monitoring contributions for new services.

Some questions to solve

Question	Plan
Check scripts provisioning	Each Salt role has a monitoring unit to deploy the scripts
Check scripts location	Somewhere in libexec like /usr/local/libexec/monitoring -> can we use a new entry in /map.jinja for common path across servers?
Check format	Nagios NRPE with exit codes 0, 1, 2 or 3
Common library	Overkill for portable NRPE checks, library would have just the exit codes. Need to maintain it in different languages like Bash and Python
Test suite for checks	Spawn a container with a specific scenario, check output with bats
Minimal expected deliverable right now	monitoring/files/check_* so check idea isn't lost, we'll add file.managed deploy logic later

NRPE exit codes

0	SUCCESS
1	WARNING	Any error to check after the critical ones.
2	CRITICAL	If the service is production, some feature is broken.
3	UNKNOWN	Something is missing for the check to run properly and determine success or failure.

Related Objects

Mentioned Here: D2648: Compare install and running kernel on FreeBSD

Event Timeline

dereckson triaged this task as High priority.Jul 21 2024, 13:39

dereckson created this task.

In D2648, NRPE directory has been set to dirs.share + "/monitoring/checks/nrpe", resolved on FreeBSD to /usr/local/share/monitoring/checks/nrpe directory.

Need to adopt this or edit https://devcentral.nasqueron.org/source/operations/browse/main/roles/core/monitoring/checks.sls$9

dereckson moved this task from Backlog to Checks on the Monitoring and reporting board.Jul 23 2024, 20:58

Document monitoring checksOpen, HighPublicActions

Description

Related Objects

Event Timeline

Document monitoring checks
Open, HighPublic
Actions