Page MenuHomeDevCentral

Server outage: ysul.nasqueron.org (zfs disk free space issue)
Closed, ResolvedPublic

Description

URL of the service

ysul.nasqueron.org

Issue

  • A zfs destroy operation was ongoing letsencrypt jail. VMWare freezes the virtual machine as there isn't any space remaining for zfs disk (storage quota reached).
  • Hypervisor or other server on the same hypervisor respond well (low ping, no packet loss, web services comfortable to use)

Event Timeline

dereckson updated the task description. (Show Details)

I'll attach a VSphere client to the server to see what happens. ETA 22:30.

dereckson renamed this task from Server outage: ysul.nasqueron.org to Server outage: ysul.nasqueron.org (zfs disk free space issue).May 8 2017, 23:12
dereckson claimed this task.
dereckson updated the task description. (Show Details)

Ysul is in snapshot mode, that explains the growing.

$ cd /vmfs/volumes/datastore1/Ysul
$ ls -lah | grep G
-rw-------    1 root     root      111.4G May  8 21:46 Ysul-000001-delta.vmdk
-rw-------    1 root     root        3.5G May  8 05:11 Ysul-e0da8ef8.vswp
-rw-------    1 root     root      400.0G Oct 19  2016 Ysul-flat.vmdk

Solution is rather easy: we don't currently use the Docker 200G disk previously attached on Dwellers. We can drop that one.

dereckson lowered the priority of this task from Unbreak Now! to High.May 8 2017, 23:45

Hypervisor happy, server running again.

Let's quit the snapshot situation.

Snapshot consolidation started, that will take a long time. During this operation, I/O perfs will be degraded.

To give a little idea about how this process is slow:

  • operation started at 23:59:09
  • 15 minutes later (now), we're at 1% done

Server consolidation done.

2017-05-09T02:49:56.807Z| vcpu-2| I120: HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/53b61491-b0561938-eec8-0cc47a055d1e/Ysul/Ysul.vmdk'
2017-05-09T02:51:53.022Z| vcpu-1| I120: HBACommon: First write on scsi0:1.fileName='/vmfs/volumes/53b61491-b0561938-eec8-0cc47a055d1e/Ysul/Ysul-zfspool.vmdk'

A side issue: ntpd wasn't running, and a drift existed. That has been resolved at 12:28:49.