[Roadmap] Run periodically tests/prod-environment-behaves-correctly and report results
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	dereckson
	Jul 27 2016, 15:04

Description

We have some tests on rOPS in the folder tests/prod-environment-behaves-correctly to check if all looks good.

The main idea of these tests is to run a test suite after a deployment to check nothing has been forgotten. It's useful after a full restart of Docker containers.

Currently, there is no reporting system like Icinga. Icinga 2 has introduced a "monitoring as code", which could be as helpful as Jenkinsfile pipeline as code (ie yes, really helpful), but meanwhile implementation, we can leverage Jenkins and this test suite to get a tests-based monitoring system.

Drawback if we only have a result "prod ok" "prod not ok" but this is better than no notification at all until manual triggering.

Roadmap

D575/ccd4fed8 Create a Jenkins job to run tests
T946 Split rOPS and a new rTESTSPRODENV repository
Allow to run the currently three skipped tests:
1. ~~Create a dedicated container with privileged permissions (a right on the Docker engine) only to run these tests~~
2. T960 Refactor Ysul Apache SuEXEC test so we can check a 200 code instead of manually checking version
  1. deploy some test.cgi script with an output of 200 ALIVE
  2. deploy some test.php script in chmod 644 (no execution bit), so we see if PHP patch has been included in the build ; that's the most frequent issue to catch
  3. set up a qa user account, so we can detect if a build skips -D AP_USERDIR_SUFFIX="public_html"
3. Run the updated DPHAB image when T947 is resolved to get container monitoring inside Phabricator
4. ~~Drop privileged container and~~ use a php standard node
Report tests result on #nasqueron-ops
1. Allow to filter, so any first failure and first succesful build after a failure (recovery) are reported
2. T953 Add support for Jenkins to the notifications center ~~or directly to the RabbitMQ queue~~
3. Ask Jenkins to notify us with results:
  - Through the notifications center? Could use Notifications plugin allows webhooks,
  - ~~Directly to the broker? Could use the RabbitMQ build trigger plugin which has a feature to publish build results too, but that would be a standard format, not our notifications one.~~
4. Consume such notifications
Consider to automate Cachet service status update or warn about incoherence between Cachet data and test data (Cachet = http://status.nasqueron.org)

Notes

The step "get notifications from Jenkins" seems a work a little heavier than "leverage existing", but already planned: we need to advertise when tests on master branch fail.

During full Docker engine restart scenario, the notifications while work when ci, notifications, white-rabbit containers aren't available. This is a good argument to separate CI-CD from the rest of infrastructure (but we don't have currently infinite servers).

Revisions and Commits

rTESTSPRODENV Test suite for operations: prod-environment-behaves-correctly
	D613	rTESTSPRODENVc78598d85029 Publish JUnit XML report
rOPSDATAN Docker: /data/notifications
	D631	rOPSDATAN4deed206618a Configuration for Jenkins

Related Objects
Search...

Status	Assigned	Task
Resolved	dereckson	T948 [Roadmap] Run periodically tests/prod-environment-behaves-correctly and report results
Resolved	dereckson	T953 Handle Jenkins notification plugin payloads
Open	None	T954 Get a mapping class from an instance
Resolved	dereckson	T946 Extract tests/prod-environment-behaves-correctly to a standalone repository
Resolved	dereckson	T956 Install Notifications plugin on Jenkins
Resolved	dereckson	T960 Create qa account on Ysul for public_html testing

Event Timeline

dereckson created this task.Jul 27 2016, 15:04

dereckson added a subtask: T953: Handle Jenkins notification plugin payloads.Jul 28 2016, 14:47

dereckson added a subtask: T946: Extract tests/prod-environment-behaves-correctly to a standalone repository.

dereckson mentioned this in T947: Create a Phabricator application to report monitoring results from its Docker container and allow upgrade.Jul 28 2016, 15:01

dereckson added a project: User-Dereckson.Jul 28 2016, 19:38

dereckson moved this task from Backlog to In progress on the User-Dereckson board.

dereckson mentioned this in T572: Prepare Jenkins slave agent containers.Jul 28 2016, 20:04

dereckson moved this task from Backlog to Jenkins on the Continous integration and delivery board.Jul 28 2016, 20:09

dereckson edited projects, added Jenkins; removed Continous integration and delivery.Jul 28 2016, 20:13

https://ci.nasqueron.org/job/test-prod-env/rssFailed shows three failure, with success the immediate next build:

#749 — 504 on notifications.nasqueron.org
#494 — 504 on builds.nasqueron.org
#225 — HTTP request failed on builds.nasqueron.org

So when the request come back with a failure it should immediately triggers a new build to avoid to flood us with alerts about 504/timeouts.

We know Dwellers is heavily swapped (currently 1.6 Go in swap), so these timeout are expected until we take a 16 Gb server dedicated to our Docker engine.

As far as monitoring is concerned, report is only useful if 504 isn't isolated.

dereckson created subtask T956: Install Notifications plugin on Jenkins.Jul 28 2016, 21:28

From a security point of view, to give access to a Jenkins slave node to the Docker engine means giving a root access to all the containers to the Trusted users group.

dereckson added a project: security.Jul 29 2016, 13:39

dereckson updated the task description. (Show Details)Jul 29 2016, 13:41

dereckson created subtask T960: Create qa account on Ysul for public_html testing.Jul 29 2016, 17:39

dereckson closed subtask T960: Create qa account on Ysul for public_html testing as Resolved.Jul 29 2016, 17:43

3A alternative: a cron job running the command for us and generating a report published somewhere Jenkins has access to.

That means this is probably easier to monitor Phabricator instances as a separate service: the script could fire a notification itself.

But then, that would move the problem: how to we monitor this script?

dereckson moved this task from In progress to Needs Review / Blocked / Waiting on the User-Dereckson board.Aug 10 2016, 13:57

dereckson updated the task description. (Show Details)Aug 10 2016, 18:11

dereckson added a revision: D613: Publish JUnit XML report.Aug 15 2016, 21:47

dereckson added a commit: rTESTSPRODENVc78598d85029: Publish JUnit XML report.Aug 15 2016, 21:54

dereckson closed subtask T953: Handle Jenkins notification plugin payloads as Resolved.Aug 23 2016, 20:49

Step 4 done for B to D, but currently every failure will be reported. And we don't exploit in the artefact log what's failing.

dereckson mentioned this in D631: Configuration for Jenkins.Aug 23 2016, 20:59

dereckson mentioned this in rOPSDATAN4deed206618a: Configuration for Jenkins.Aug 23 2016, 21:03

refactoring done.
filtering partially done: only failure are reported, Jenkins notifies, we consume.
to automate without human assertion system status isn't currently considered as valuable

dereckson added a commit: rOPSDATAN4deed206618a: Configuration for Jenkins.Jan 23 2017, 12:55

dereckson added a revision: D631: Configuration for Jenkins.

dereckson closed subtask T956: Install Notifications plugin on Jenkins as Resolved.

dereckson closed this task as Resolved.Jan 23 2017, 12:58

[Roadmap] Run periodically tests/prod-environment-behaves-correctly and report resultsClosed, ResolvedPublicActions

Description

Revisions and Commits

Related ObjectsSearch...

Event Timeline

[Roadmap] Run periodically tests/prod-environment-behaves-correctly and report results
Closed, ResolvedPublic
Actions

Related Objects
Search...