Page MenuHomeDevCentral

Automate Kafka cluster healing
Open, HighPublic

Description

Yesterday evening, we got a new Kafka lag.

Detect if one of the sentry_post_process_forwarder_ container dies.

If so, check log for message and if applicable, apply automatically procedure described in the NOG: https://agora.nasqueron.org/Operations_grimoire/Sentry#Kafka

That would mean:

  1. Notify the operation start
  2. Stop both containers, as we can update offset as long as a client is connected to that consumer group
  3. Reset Kafka offset
  4. Start both containers
  5. Notify the operation is done

If it's another error, as we've a script on it, it would be nice if it creates a task on DevCentral with the log tail.

Notification can be done with notification-push
Docker Tide can be used to detect the need for healing

How to detect the issue
docker logs output critical error to stderr, so first redirect to stdout, we're only interested in the last line before the crash, previous ones would be previous occurrences:

docker logs sentry_post_process_forwarder_errors --tail=1 2>&1 | grep -qF arroyo.errors.OffsetOutOfRange

Java Kafka API reference
Procedure offers to use CLI client, and comments show we can check the lag column before and after proceeding

If we prefer a Java command to automate and check the operation, we can use:

Event Timeline

$ notification-push --project Nasqueron --group ops --service monitoring --type autoheal.kafka_offset.start --text "Containers sentry_post_process_forwarder_ have an issue. Identified as Kafka offet issue. Starting automatic healing procedure."
$ notification-push --project Nasqueron --group ops --service monitoring --type autoheal.kafka_offset.done --text "Containers sentry_post_process_forwarder_ automatic healing one. Containers should be alive."

$ notification-push --project Nasqueron --group ops --service monitoring --type autoheal.kafka.unknown --text "Containers sentry_post_process_forwarder_ have an issue. This is NOT a Kafka offset one." --link https://devcentral.nasqueron.org/T...

If one of the two topics lag, we'd have:

$ kafka-consumer-groups --bootstrap-server localhost:9092 --group snuba-post-processor -describe

GROUP                TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-post-processor transactions    0          879             880             1               rdkafka-59c2d3f9-53a0-47c3-aeec-3c3077e54e4c /172.18.3.23    rdkafka
snuba-post-processor events          0          52              52              0               rdkafka-2455812f-6810-440c-86ab-0c068604da4f /172.18.3.24    rdkafka

$ kafka-consumer-groups --bootstrap-server localhost:9092 --group snuba-post-processor -describe | awk '(NR>2) {print $6}' | grep -c -v 0
1

So I guess we can use command line instead of the SDK and validate with -describe we've only 0 in the LAG column using that command

[ Those tasks have been identified as suitable for the next operations sprint. ]