Page MenuHomeDevCentral

Automate GRE tunnel failover on CARP master switch
Open, NormalPublic

Description

The goal of this task is to implement an automated failover mechanism that ensures GRE tunnels always point to the current CARP MASTER.

When a router transitions to the ACTIVE state, a devd-triggered script implemented in D4033 will emit a Salt event. This event is received by the Salt master, which reacts using a reactor to trigger GRE tunnel reconfiguration on Ysul and Windriver.

The reconfiguration process will:

  1. Remove the existing GRE tunnel
  2. Recreate a new tunnel toward the new ACTIVE router
  3. Reload IPsec

This approach ensures that tunnel configuration dynamically follows CARP state changes, avoiding manual intervention and reducing downtime during failover events.


Steps :

  • 1. Send a test Salt event from a router to validate event emission
sudo salt-call event.send 'carp/master' '{"router": "router-003"}'
  • 2. Verify that the event is correctly received on the Salt master event bus
salt-run state.event pretty=True
  • 3. Integrate the event emission into the devd-triggered script (D4033)
  • 4. Configure a Salt reactor to listen for the CARP MASTER event and trigger an action
  • 5. Implement a script to handle GRE tunnel reconfiguration on Ysul
  • 6. Trigger the reconfiguration from the reactor upon event reception
  • 7. Test the full failover scenario (CARP switch) and validate tunnel recreation
  • 8. Verify connectivity and routing after failover

If everything works as expected, we can then test the setup with Windriver, as GRE and CARP interactions have previously caused issues. At this stage, it is still unclear which component is responsible for the problem.


References :

https://docs.saltproject.io/en/3007/ref/modules/all/salt.modules.event.html
https://mpolinowski.github.io/docs/DevOps/Salt/2020-06-20--salt-reactor-events/2020-06-20/

Event Timeline

Duranzed triaged this task as Normal priority.Wed, Apr 22, 12:26
Duranzed created this task.
yousra updated the task description. (Show Details)

Test to validate Salt event emission and reception on the master :

yousra@complector /opt/salt/nasqueron-operations]$ salt-run state.event pretty=True

salt/auth	{
    "_stamp": "2026-04-26T14:27:48.132991",
    "act": "accept",
    "id": "router-003",
    "pub": "-----BEGIN PUBLIC KEY-----\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAmiiNW666p/EZ3lVD1cCi\nCYsrF+V4v4eiZG+5fKXVJAoknyrcQ7ED6LhAagR1aqWI9KkEVALKM8TMB7qT827A\n/c/+YnQauVADmdjOkEs7DIaJa5RQZgx4qEmqHPryJUGCkS8AfxF8H6YoM8CUyTuR\na2JrLrMerXqHyFC/ZNt9yIujXhZowIRTR7WQCd6OGShnUH/uKi8TSc8yjwfqGUdM\nQJ0HJjjVNaZ8XfNox4ml6Q1acMCZtTOH5/LHstqrIp3Y87g8bLtnk0jJbp8Hbfbj\nwDhhXZYt2W9eEdCdEMhgPr4campuuE9DrMV3UfRieXPSrRb3gqCpyLy8EXQDZtX7\nJQIDAQAB\n-----END PUBLIC KEY-----",
    "result": true
}
minion/refresh/router-003	{
    "Minion data cache refresh": "router-003",
    "_stamp": "2026-04-26T14:27:49.020788"
}
test/carp	{
    "_stamp": "2026-04-26T14:27:49.150709",
    "cmd": "_minion_event",
    "data": {
        "__pub_fun": "event.send",
        "__pub_jid": "20260426142749137759",
        "__pub_pid": 6144,
        "__pub_tgt": "salt-call",
        "router": "router-003"
    },
    "id": "router-003",
    "tag": "test/carp",
    "ts": 1777213669
}
salt/job/20260426142749162569/ret/router-003	{
    "_stamp": "2026-04-26T14:27:49.164093",
    "arg": [
        "test/carp",
        "{\"router\": \"router-003\"}"
    ],
    "cmd": "_return",
    "fun": "event.send",
    "fun_args": [
        "test/carp",
        "{\"router\": \"router-003\"}"
    ],
    "id": "router-003",
    "jid": "20260426142749162569",
    "retcode": 0,
    "return": true,
    "tgt": "router-003",
    "tgt_type": "glob",
    "ts": 1777213669
}

[yousra@router-003 ~]$ sudo salt-call event.send 'test/carp' '{"router": "router-003"}'
local:

True

I created and tested a Salt reactor that listens for the carp/master event sent by the routers. For now, the reactor only runs a test command on Ysul and Windriver to confirm that the event is correctly received and that the master can trigger actions on those hosts.

The next step will be to replace this test command with the real GRE tunnel reconfiguration script

[yousra@router-003 /usr/local/libexec/carp]$ sudo /usr/local/libexec/carp/carp-ovh-failover 2@vmx1 MASTER

Detected MAC on vmx1: 00:50:56:09:98:fc
Checking current state...
Checking IPs for MAC 00:50:56:09:98:fc
OVH returned: ['51.68.252.230', '178.32.70.111']
VIP is already on correct MAC -> nothing to do
Sending Salt event: carp/master {"router": "router-003"}
local:
    True

On /usr/local/etc/salt/master.d/carp-master-reactor.conf

reactor:
  - 'carp/master':
    - /srv/reactor/gre.sls

On /srv/reactor/gre.sls

test_reactor:
  local.cmd.run:
    - tgt: 'ysul,windriver'
    - tgt_type: list
    - arg:
      - 'echo OK_CARP_MASTER_{{ data["data"]["router"] }} >> /tmp/test-reactor.log'

Result : the action was successfully executed

[yousra@windriver /tmp]$ cat test-reactor.log 
OK_CARP_MASTER_router-003
OK_CARP_MASTER_router-003
OK_CARP_MASTER_router-003
[yousra@ysul /tmp]$ cat test-reactor.log 
OK_CARP_MASTER_router-003
OK_CARP_MASTER_router-003
OK_CARP_MASTER_router-003

@dereckson I am not sure about the paths, could you check please ?

yousra updated the task description. (Show Details)

When creating a GRE tunnel to the alias IP of ysul as an endpoint the tunnel is unpingable however when creating the GRE tunnel using the public IP of ysul, GRE tunnel responds well to ping.
I suspect that the problem might come from using an alias IP as GRE endpoint that might cause this as it suggest encapsulation/decapsulation issues

	inet 163.172.49.16 netmask 0xffffff00 broadcast 163.172.49.255
	inet 212.83.187.132 netmask 0xffffffff broadcast 212.83.187.132

We noticed that Windriver is unable to ping the public IP addresses of router-002 and router-003. However, GRE tunnel creation is successful, and tunnel connectivity works with router-002, although pinging its public IP is still unsuccessful.

yousra updated the task description. (Show Details)