Automate Kafka cluster healing
Open, HighPublic
Actions

Assigned To

None

Authored By

	dereckson
	Apr 2 2023, 10:31

Description

Yesterday evening, we got a new Kafka lag.

Detect if one of the sentry_post_process_forwarder_ container dies.

If so, check log for message and if applicable, apply automatically procedure described in the NOG: https://agora.nasqueron.org/Operations_grimoire/Sentry#Kafka

That would mean:

Notify the operation start
Stop both containers, as we can update offset as long as a client is connected to that consumer group
Reset Kafka offset
Start both containers
Notify the operation is done

If it's another error, as we've a script on it, it would be nice if it creates a task on DevCentral with the log tail.

Notification can be done with notification-push
Docker Tide can be used to detect the need for healing

How to detect the issue
docker logs output critical error to stderr, so first redirect to stdout, we're only interested in the last line before the crash, previous ones would be previous occurrences:

docker logs sentry_post_process_forwarder_errors --tail=1 2>&1 | grep -qF arroyo.errors.OffsetOutOfRange

Java Kafka API reference
Procedure offers to use CLI client, and comments show we can check the lag column before and after proceeding

If we prefer a Java command to automate and check the operation, we can use:

Related Objects
Search...

Status	Assigned	Task
Open	None	T1621 Prepare a more flexible containers platform
Open	dereckson	T1809 Propagate containers-related events
Open	inidal	T771 Allow to send notifications from the command line
Open	None	T1816 Automate Kafka cluster healing

Event Timeline

dereckson triaged this task as High priority.Apr 2 2023, 10:31

dereckson created this task.

dereckson added parent tasks: T771: Allow to send notifications from the command line, T1809: Propagate containers-related events.

$ notification-push --project Nasqueron --group ops --service monitoring --type autoheal.kafka_offset.start --text "Containers sentry_post_process_forwarder_ have an issue. Identified as Kafka offet issue. Starting automatic healing procedure."
$ notification-push --project Nasqueron --group ops --service monitoring --type autoheal.kafka_offset.done --text "Containers sentry_post_process_forwarder_ automatic healing one. Containers should be alive."

$ notification-push --project Nasqueron --group ops --service monitoring --type autoheal.kafka.unknown --text "Containers sentry_post_process_forwarder_ have an issue. This is NOT a Kafka offset one." --link https://devcentral.nasqueron.org/T...

If one of the two topics lag, we'd have:

$ kafka-consumer-groups --bootstrap-server localhost:9092 --group snuba-post-processor -describe

GROUP                TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID                                  HOST            CLIENT-ID
snuba-post-processor transactions    0          879             880             1               rdkafka-59c2d3f9-53a0-47c3-aeec-3c3077e54e4c /172.18.3.23    rdkafka
snuba-post-processor events          0          52              52              0               rdkafka-2455812f-6810-440c-86ab-0c068604da4f /172.18.3.24    rdkafka

$ kafka-consumer-groups --bootstrap-server localhost:9092 --group snuba-post-processor -describe | awk '(NR>2) {print $6}' | grep -c -v 0
1

So I guess we can use command line instead of the SDK and validate with -describe we've only 0 in the LAG column using that command

dereckson updated the task description. (Show Details)Apr 2 2023, 11:29

dereckson updated the task description. (Show Details)Apr 2 2023, 11:38

[ Those tasks have been identified as suitable for the next operations sprint. ]

dereckson moved this task from Backlog to Backlog - Docker on the Operations sprints (Ignite Alkane Propulsion) board.May 6 2023, 15:55

Automate Kafka cluster healingOpen, HighPublicActions

Description

Related ObjectsSearch...

Event Timeline

Automate Kafka cluster healing
Open, HighPublic
Actions

Related Objects
Search...