We have some tests on rOPS in the folder tests/prod-environment-behaves-correctly to check if all looks good.
The main idea of these tests is to run a test suite after a deployment to check nothing has been forgotten. It's useful after a full restart of Docker containers.
Currently, there is no reporting system like Icinga. Icinga 2 has introduced a "monitoring as code", which could be as helpful as Jenkinsfile pipeline as code (ie yes, really helpful), but meanwhile implementation, we can leverage Jenkins and this test suite to get a tests-based monitoring system.
Drawback if we only have a result "prod ok" "prod not ok" but this is better than no notification at all until manual triggering.
Roadmap
- D575/ccd4fed8 Create a Jenkins job to run tests
- T946 Split rOPS and a new rTESTSPRODENV repository
- Allow to run the currently three skipped tests:
Create a dedicated container with privileged permissions (a right on the Docker engine) only to run these tests- T960 Refactor Ysul Apache SuEXEC test so we can check a 200 code instead of manually checking version
- deploy some test.cgi script with an output of 200 ALIVE
- deploy some test.php script in chmod 644 (no execution bit), so we see if PHP patch has been included in the build ; that's the most frequent issue to catch
- set up a qa user account, so we can detect if a build skips -D AP_USERDIR_SUFFIX="public_html"
- Run the updated DPHAB image when T947 is resolved to get container monitoring inside Phabricator
Drop privileged container anduse a php standard node
- Report tests result on #nasqueron-ops
- Allow to filter, so any first failure and first succesful build after a failure (recovery) are reported
- T953 Add support for Jenkins to the notifications center
or directly to the RabbitMQ queue - Ask Jenkins to notify us with results:
- Through the notifications center? Could use Notifications plugin allows webhooks,
Directly to the broker? Could use the RabbitMQ build trigger plugin which has a feature to publish build results too, but that would be a standard format, not our notifications one.
- Consume such notifications
- Consider to automate Cachet service status update or warn about incoherence between Cachet data and test data (Cachet = http://status.nasqueron.org)
Notes
The step "get notifications from Jenkins" seems a work a little heavier than "leverage existing", but already planned: we need to advertise when tests on master branch fail.
During full Docker engine restart scenario, the notifications while work when ci, notifications, white-rabbit containers aren't available. This is a good argument to separate CI-CD from the rest of infrastructure (but we don't have currently infinite servers).