Page MenuHomeDevCentral

Server outage: ysul.nasqueron.org (zfs disk free space issue)
Closed, ResolvedPublic

Description

URL of the service

ysul.nasqueron.org

Issue

  • A zfs destroy operation was ongoing letsencrypt jail. VMWare freezes the virtual machine as there isn't any space remaining for zfs disk (storage quota reached).
  • Hypervisor or other server on the same hypervisor respond well (low ping, no packet loss, web services comfortable to use)

Event Timeline

dereckson created this task.EditedMay 8 2017, 21:56
dereckson updated the task description. (Show Details)

I'll attach a VSphere client to the server to see what happens. ETA 22:30.

dereckson renamed this task from Server outage: ysul.nasqueron.org to Server outage: ysul.nasqueron.org (zfs disk free space issue).May 8 2017, 23:12
dereckson updated the task description. (Show Details)
dereckson claimed this task.
dereckson added a comment.EditedMay 8 2017, 23:17

Ysul is in snapshot mode, that explains the growing.

$ cd /vmfs/volumes/datastore1/Ysul
$ ls -lah | grep G
-rw-------    1 root     root      111.4G May  8 21:46 Ysul-000001-delta.vmdk
-rw-------    1 root     root        3.5G May  8 05:11 Ysul-e0da8ef8.vswp
-rw-------    1 root     root      400.0G Oct 19  2016 Ysul-flat.vmdk

Solution is rather easy: we don't currently use the Docker 200G disk previously attached on Dwellers. We can drop that one.

dereckson lowered the priority of this task from Unbreak Now! to High.May 8 2017, 23:45

Hypervisor happy, server running again.

Let's quit the snapshot situation.

Snapshot consolidation started, that will take a long time. During this operation, I/O perfs will be degraded.

dereckson moved this task from Backlog to Working on on the Servers board.May 9 2017, 00:00
dereckson added a project: User-Dereckson.
dereckson added a comment.EditedMay 9 2017, 00:14

To give a little idea about how this process is slow:

  • operation started at 23:59:09
  • 15 minutes later (now), we're at 1% done
dereckson closed this task as Resolved.May 9 2017, 12:40

Server consolidation done.

2017-05-09T02:49:56.807Z| vcpu-2| I120: HBACommon: First write on scsi0:0.fileName='/vmfs/volumes/53b61491-b0561938-eec8-0cc47a055d1e/Ysul/Ysul.vmdk'
2017-05-09T02:51:53.022Z| vcpu-1| I120: HBACommon: First write on scsi0:1.fileName='/vmfs/volumes/53b61491-b0561938-eec8-0cc47a055d1e/Ysul/Ysul-zfspool.vmdk'

dereckson added a comment.EditedMay 9 2017, 12:49

A side issue: ntpd wasn't running, and a drift existed. That has been resolved at 12:28:49.