Page MenuHomeDevCentral

Ysul doesn't answer on the network
Closed, ResolvedPublic

Description

Last actions done

  • Increase UDP buffer size, as transmission complained:
    • kern.ipc.maxsockbuf to 4194304 (4M)
    • net.inet.udp.recvspace to 4194304 also (buffer sized transmission wanted)
  • Check if TCP trafic on port 3120 still exists
    • net.inet.tcp.log_in_vain to 1

Suggestion to fix

  • At the console, restore decent settings
  • If it fails, reboot
  • Change failover IP address if there are still issues

Related Objects

Event Timeline

dereckson claimed this task.
dereckson raised the priority of this task from to Unbreak Now!.
dereckson updated the task description. (Show Details)
dereckson added a project: Servers.
dereckson added a subscriber: dereckson.

Server bandwidth

graph-daily_bandwidth-125620 (1).png (340×606 px, 28 KB)

  • At left, squid ACL error, leading to an open proxy situation
  • At the center, normal operations bandwidth
  • At right, transmission bandwidth

Can't connect to the hypervisor through vSphere (but answers immediately in SSH).

Update: that doesn't seem to be a real solution, on SSH:

~ # esxcli vm process list
Connect to localhost failed: Connection failure

Datastore partition on hypervisor were at 100%.

I deleted content from /vmfs/volumes/datastore1/ISO to free 10 GB.

I try a services.sh restart on the hypervisor.

Works. Ysul answers again.

To be confirmed by log, but most probable hypothesis is Ysul were automatically paused when datastore were full and it claims more storage space.

On Ysul: # sysctl net.inet.tcp.log_in_vain=0

Pause were perhaps justified:

Ysul log before pause

Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 00 04 d7 75 a2 00 01 00 00
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): CAM status: SCSI Status Error
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): SCSI status: Busy
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): Retrying command
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 00 04 d7 76 a2 00 00 40 00
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): CAM status: SCSI Status Error
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): SCSI status: Busy
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): Retrying command
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 00 04 d7 7b a2 00 01 00 00
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): CAM status: SCSI Status Error
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): SCSI status: Busy
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): Retrying command
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 00 04 d7 7c a2 00 00 c0 00
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): CAM status: SCSI Status Error
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): SCSI status: Busy
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): Retrying command
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 00 04 d7 84 62 00 01 00 00
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): CAM status: SCSI Status Error
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): SCSI status: Busy
Aug 2 07:48:33 ysul kernel: (da0:
Aug 2 07:48:33 ysul kernel: mpt0:0:
Aug 2 07:48:33 ysul kernel: 0:
Aug 2 07:48:33 ysul kernel: 0):
Aug 2 07:48:33 ysul kernel: Retrying command
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 00 04 d7 85 62 00 01 00 00
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): CAM status: SCSI Status Error
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): SCSI status: Busy
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): Retrying command
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): WRITE(10). CDB: 2a 00 04 d7 8f a2 00 01 00 00
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): CAM status: SCSI Status Error
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): SCSI status: Busy
Aug 2 07:48:33 ysul kernel: (da0:mpt0:0:0:0): Retrying command

I tried to restart netif, but that's not really promising (see P107), so I reboot the machine.

That should allow to free memory, network connections buffers and tables.

For information, uptime were 12 days.

  • tmux works
  • At start time, MySQL: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2) but works after a restart of MySQL server
  • I restored the blackhole mitigation configuration net.inet.tcp.blackhole: 0 -> 2
  • Outgoing network works for http://tools.nasqueron.org/wikimedia/write/sourcetemplatesgenerator/

root@ysul:~ # ntpdate fr.pool.ntp.org
2 Aug 09:14:28 ntpdate[1739]: step time server 212.83.131.33 offset 2906.905153 sec

dereckson lowered the priority of this task from Unbreak Now! to Normal.EditedAug 2 2015, 09:15

11:13 -!- Daeghrefn [surfboard@wikimedia/bot/Daeghrefn] has joined #wikipedia-fr

/home/dereckson/dev/nasqueron/operations/tests/prod-environment-behaves-correctly ] make test
[...]
OK (2 tests, 7 assertions)

Okay we can decrease the incident severity. All systems looks good to me.

I resolve the ticket but see T521 to prevent this to come back in the future.