Page MenuHomeDevCentral

[Roadmap] Run periodically tests/prod-environment-behaves-correctly and report results
Closed, ResolvedPublic

Description

We have some tests on rOPS in the folder tests/prod-environment-behaves-correctly to check if all looks good.

The main idea of these tests is to run a test suite after a deployment to check nothing has been forgotten. It's useful after a full restart of Docker containers.

Currently, there is no reporting system like Icinga. Icinga 2 has introduced a "monitoring as code", which could be as helpful as Jenkinsfile pipeline as code (ie yes, really helpful), but meanwhile implementation, we can leverage Jenkins and this test suite to get a tests-based monitoring system.

Drawback if we only have a result "prod ok" "prod not ok" but this is better than no notification at all until manual triggering.

Roadmap

  1. D575/ccd4fed8 Create a Jenkins job to run tests
  2. T946 Split rOPS and a new rTESTSPRODENV repository
  3. Allow to run the currently three skipped tests:
    1. Create a dedicated container with privileged permissions (a right on the Docker engine) only to run these tests
    2. T960 Refactor Ysul Apache SuEXEC test so we can check a 200 code instead of manually checking version
      1. deploy some test.cgi script with an output of 200 ALIVE
      2. deploy some test.php script in chmod 644 (no execution bit), so we see if PHP patch has been included in the build ; that's the most frequent issue to catch
      3. set up a qa user account, so we can detect if a build skips -D AP_USERDIR_SUFFIX="public_html"
    3. Run the updated DPHAB image when T947 is resolved to get container monitoring inside Phabricator
    4. Drop privileged container and use a php standard node
  4. Report tests result on #nasqueron-ops
    1. Allow to filter, so any first failure and first succesful build after a failure (recovery) are reported
    2. T953 Add support for Jenkins to the notifications center or directly to the RabbitMQ queue
    3. Ask Jenkins to notify us with results:
      • Through the notifications center? Could use Notifications plugin allows webhooks,
      • Directly to the broker? Could use the RabbitMQ build trigger plugin which has a feature to publish build results too, but that would be a standard format, not our notifications one.
    4. Consume such notifications
  5. Consider to automate Cachet service status update or warn about incoherence between Cachet data and test data (Cachet = http://status.nasqueron.org)

Notes

The step "get notifications from Jenkins" seems a work a little heavier than "leverage existing", but already planned: we need to advertise when tests on master branch fail.

During full Docker engine restart scenario, the notifications while work when ci, notifications, white-rabbit containers aren't available. This is a good argument to separate CI-CD from the rest of infrastructure (but we don't have currently infinite servers).

Event Timeline

https://ci.nasqueron.org/job/test-prod-env/rssFailed shows three failure, with success the immediate next build:

  • #749 — 504 on notifications.nasqueron.org
  • #494 — 504 on builds.nasqueron.org
  • #225 — HTTP request failed on builds.nasqueron.org

So when the request come back with a failure it should immediately triggers a new build to avoid to flood us with alerts about 504/timeouts.

We know Dwellers is heavily swapped (currently 1.6 Go in swap), so these timeout are expected until we take a 16 Gb server dedicated to our Docker engine.

As far as monitoring is concerned, report is only useful if 504 isn't isolated.

From a security point of view, to give access to a Jenkins slave node to the Docker engine means giving a root access to all the containers to the Trusted users group.

3A alternative: a cron job running the command for us and generating a report published somewhere Jenkins has access to.

That means this is probably easier to monitor Phabricator instances as a separate service: the script could fire a notification itself.

But then, that would move the problem: how to we monitor this script?

Step 4 done for B to D, but currently every failure will be reported. And we don't exploit in the artefact log what's failing.

  1. refactoring done.
  2. filtering partially done: only failure are reported, Jenkins notifies, we consume.
  3. to automate without human assertion system status isn't currently considered as valuable