Yesterday evening, we got a new Kafka lag.
Detect if one of the sentry_post_process_forwarder_ container dies.
If so, check log for message and if applicable, apply automatically procedure described in the NOG: https://agora.nasqueron.org/Operations_grimoire/Sentry#Kafka
That would mean:
- Notify the operation start
- Stop both containers, as we can update offset as long as a client is connected to that consumer group
- Reset Kafka offset
- Start both containers
- Notify the operation is done
If it's another error, as we've a script on it, it would be nice if it creates a task on DevCentral with the log tail.
Notification can be done with notification-push
Docker Tide can be used to detect the need for healing
How to detect the issue
docker logs output critical error to stderr, so first redirect to stdout, we're only interested in the last line before the crash, previous ones would be previous occurrences:
docker logs sentry_post_process_forwarder_errors --tail=1 2>&1 | grep -qF arroyo.errors.OffsetOutOfRange
Java Kafka API reference
Procedure offers to use CLI client, and comments show we can check the lag column before and after proceeding
If we prefer a Java command to automate and check the operation, we can use: