Page MenuHomeDevCentral

Configure HTTP health checks monitoring for Docker engines
ClosedPublic

Authored by dereckson on Mar 19 2022, 14:53.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jan 22, 23:35
Unknown Object (File)
Wed, Jan 22, 23:27
Unknown Object (File)
Tue, Jan 21, 07:21
Unknown Object (File)
Sat, Jan 18, 03:38
Unknown Object (File)
Fri, Jan 17, 16:25
Unknown Object (File)
Tue, Jan 14, 14:14
Unknown Object (File)
Sat, Jan 11, 09:52
Unknown Object (File)
Fri, Jan 10, 08:35
Subscribers
None

Details

Summary

Provide a platform-checks configuration file for HTTP health checks.

The list is extracted from the containers in pillar matching two conditions:

  1. The service has one or more container running on that engine
  2. The service has a health check URL set in docker_containers_monitoring

This unit is responsible to provide platform-cheks and its configuration,
not to actually run the tests or provide a runner.

Ref T1704

Generate YAML configuration files

We need to aggregate a checks dictionary from several sources,
here for example from paas_docker.get_health_checks(), but we
plan in following changes to add more monitoring checks sources.

The convert.to_yaml_dictionary() method allow to build arbitrary
dictionaries from several sources and allow to do this aggregation.

It uses the salt.serializers.yaml package, as it allows to represent
objects from the salt.utils.odict.OrderedDict class, with an already
configured dumper for pyyaml.

Test Plan

Run check_http_200

Diff Detail

Repository
rOPS Nasqueron Operations
Lint
Lint Passed
SeverityLocationCodeMessage
Advice_modules/paas_docker.py:74F821flake8 F821
Advice_modules/paas_docker.py:74F821flake8 F821
Advice_modules/paas_docker.py:75F821flake8 F821
Unit
No Test Coverage
Branch
monitoring-http
Build Status
Buildable 4070
Build 4322: arc lint + arc unit

Event Timeline

dereckson created this revision.

We need a specialized YAML writer to:

  • aggregate the different checks collections
  • output a human-readable format

Dump correctly ordered dictionaries

-yaml, apply default_flow_style to doc

/etc/monitoring/checks.yml for docker-001
#   -------------------------------------------------------------
#   Configuration for Docker PaaS monitoring
#   - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#   Project:        Nasqueron
#   Source file:    roles/paas-docker/monitoring/files/checks.yml.jinja
#   -------------------------------------------------------------
#
#   <auto-generated>
#       This file is managed by our rOPS SaltStack repository.
#
#       Changes to this file may cause incorrect behavior
#       and will be lost if the state is redeployed.
#   </auto-generated>

#   -------------------------------------------------------------
#   Checks configuration
#   - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

checks:
  check_http_200:
    acme: http://localhost:41080/health
    api-datasources: http://localhost:19080/datasources
    cachet: http://localhost:39080/api/v1/ping
    hauk: http://localhost:43080/
    hound: http://localhost:44080/healthz
    jenkins_cd: http://localhost:38080/login
    jenkins_ci: http://localhost:42080/login
    pad: http://localhost:34080/stats
    registry: http://localhost:5000/
  check_http_200_alive:
    api-docker-registry: http://localhost:20080/status
    login: http://localhost:25080/status
    notifications: http://localhost:37080/status
    tommy_cd: http://localhost:24180/status
    tommy_ci: http://localhost:24080/status
  check_http_200_alive_proxy:
    devcentral: https://devcentral.nasqueron.org/status
    river_sector: https://river-sector.dereckson.be/status
    wolfplex_phab: https://phabricator.wolfplex.org/status
    zed_code: https://code.zed.dereckson.be/status
  check_http_200_proxy:
    openfire: https://xmpp.nasqueron.org/login.jsp
    pixelfed: https://photos.nasqueron.org/api/nodeinfo/2.0.json

Installed platform-checks==0.1.1 from test.pypi.org.

That allows to run check_http_200:

Service acme NOT healthy - HTTP 405
Service api-datasources healthy
Service cachet healthy
Service hauk healthy
Service hound healthy
Service jenkins_cd healthy
Service jenkins_ci healthy
Service pad healthy
Service registry healthy
Service api-docker-registry healthy
Service login healthy
Service notifications healthy
Service tommy_cd NOT healthy - HTTPConnectionPool(host='localhost', port=24180): Max retries exceeded with url: /status (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3d1a9055c0>: Failed to establish a new connection: [Errno 111] Connection refused',))
Service tommy_ci healthy
Service devcentral healthy - Checked at PROXY level
Service river_sector NOT healthy - HTTP 302 - Checked at PROXY level
Service wolfplex_phab healthy - Checked at PROXY level
Service zed_code healthy - Checked at PROXY level
Service openfire healthy - Checked at PROXY level
Service pixelfed healthy - Checked at PROXY level

Individual checks:

$ check_http_200 acme
Service acme NOT healthy - HTTP 405
$ echo $?
2
$ check_http_200 tommy_ci
Service tommy_ci healthy
$ echo $?
0
This revision is now accepted and ready to land.Mar 19 2022, 17:18