Page MenuHomeDevCentral

Server outage: Investigate Ysul.nasqueron.org network access and unresponsive Salt minion
Open, HighPublic

Description

Ysul should be tracked separately as its own issue.

Current status:

  • Ysul has been inaccessible for more than one week
  • its Salt minion does not respond
  • IPsec has already been deployed using salt on ysul, as on router-002, router-003 and windriver before ysul started having issues.
  • the host currently cannot be used for further validation or troubleshooting

Impact:

  • prevents further validation of the current GRE/IPsec/CARP setup on all intended nodes
  • blocks further troubleshooting involving ysul
  • leaves the node outside the current operational Salt workflow

Requested investigation:

  • check network reachability to ysul
  • check Salt minion status and logs
  • check current routing/path to ysul through the new router setup

Event Timeline

Duranzed renamed this task from Server outage: Investigate Ysul network access and unresponsive Salt minion to Server outage: Investigate Ysul.nasqueron.org network access and unresponsive Salt minion.Wed, May 20, 15:16

It's possible to call Salt from Complector to the relevant server with salt <server name>, or in the other way around with salt-call.

As Salt debug is enabled on Ysul, we've a verbose log:

Ysul
$ salt-call test.ping
[DEBUG   ] Connecting to master. Attempt 1 of 1
[DEBUG   ] "complector.nasqueron.drake" Not an IP address? Assuming it is a hostname.
[DEBUG   ] Master URI: tcp://172.27.27.7:4506
[DEBUG   ] Initializing new AsyncAuth for ('/usr/local/etc/salt/pki/minion', 'ysul', 'tcp://172.27.27.7:4506', '1776438689.139444')
[DEBUG   ] Generated random reconnect delay between '1000ms' and '11000ms' (1832)
[DEBUG   ] Setting zmq_reconnect_ivl to '1832ms'
[DEBUG   ] Setting zmq_reconnect_ivl_max to '11000ms'
[DEBUG   ] salt.crypt.get_rsa_key: Loading private key
[DEBUG   ] salt.crypt._get_key_with_evict: Loading private key
[DEBUG   ] SaltEvent PUB socket URI: /var/run/salt/minion/minion_event_6f41902e5c_pub.ipc
[DEBUG   ] SaltEvent PULL socket URI: /var/run/salt/minion/minion_event_6f41902e5c_pull.ipc
[DEBUG   ] Closing AsyncReqChannel instance
[ERROR   ] Master is unavailable (Connection Cancelled).
Unable to sign_in to master: Attempt to authenticate with the salt master failed with timeout error

There are two GRE tunnels, one to router-001, one to primary router:

  • gre0 / drake_via_router-001 / 163.172.49.16 --> 51.255.124.8 / 172.27.27.33 --> 172.27.27.252
  • gre1 / ysul_to_primary_router / 163.172.49.16 -> 51.68.252.230 / 172.27.27.31 -> 172.27.27.241

We had that kind of issue before where:

  • the outgoing packet uses one tunnel
  • the reply come to another tunnel

Here, from Complector, it's not possible to ping neither.

Routes:

Complector
$ netstat -r
default            172.27.27.11       UG1            vmx0
router-002
$ netstat -r
172.27.27.31       link#7             UH             gre2

And no route to go back to 172.27.27.33.

The gre2 tunnel doesn't seem to work (can't ping 172.27.27.31).