Page MenuHomeDevCentral

Investigate the Stormshear issue
Closed, InvalidPublic

Description

Yesterday evening, our main hypervisor didn't answer ping requests after some hours, then after mere minutes.

I provisioned during the night a new hypervisor per T1215 plans and migrated Dwellers (with success) and Ysul (see below) there.

Ysul isn't there, as when I boot the Ysul VM (or a new FreeBSD VM by the way), BIOS can't see the virtual disks.

Our ISP, Online, investigates the issue.

Event Timeline

dereckson updated the task description. (Show Details)

@ledesillusionniste is checking SMART status, and a dd test to measure I/O.

Stormshear
$  esxcli storage core device smart get -d t10.ATA_____HGST_HTS721010A9E630__________________________JR10006PH7JXKE
Parameter                     Value  Threshold  Worst
----------------------------  -----  ---------  -----
Health Status                 OK     N/A        N/A  
Media Wearout Indicator       N/A    N/A        N/A  
Write Error Count             N/A    N/A        N/A  
Read Error Count              100    62         100  
Power-on Hours                32     0          32   
Power Cycle Count             100    0          100  
Reallocated Sector Count      100    5          100  
Raw Read Error Rate           100    62         100  
Drive Temperature             206    0          206  
Driver Rated Max Temperature  N/A    N/A        N/A  
Write Sectors TOT Count       200    0          200  
Read Sectors TOT Count        N/A    N/A        N/A  
Initial Bad Block Count       N/A    N/A        N/A  

$ grep smartd /var/log/syslog.log
2017-10-10T21:22:18Z smartd: [warn] t10.ATA_____HGST_HTS721010A9E630__________________________JR10006PH7JXKE: above TEMPERATURE threshold (206 > 0)

Meanwhile, at 10:30 UTC, Ysul froze, hypervisor performance graphs simply showed a stop of activity.

Hard drive is fine according @ledesillusionniste:

22:11:07 < philectro> le hdd est impecc apparemment
22:11:11 < philectro> donc problème logiciel
22:11:41 < philectro> il a quand même 30kh mais r-à-s

1$ esxcli software acceptance set --level=CommunitySupported
2Host acceptance level changed to 'CommunitySupported'.
3$ cd /tmp
4$ wget http://www.virten.net/files/smartctl-6.6-4321.x86_64.vib
5$ esxcli software vib install -v /tmp/smartctl-6.6-4321.x86_64.vib
6Installation Result
7 Message: Operation finished successfully.
8 Reboot Required: false
9 VIBs Installed: smartmontools_bootbank_smartctl_6.6-4321
10 VIBs Removed:
11 VIBs Skipped:
12$ /opt/smartmontools/smartctl -d sat --all /dev/disks/t10.ATA_____HGST_HTS721010A9E630__________________________JR10006PH7JXKE
13smartctl 6.6 2016-05-10 r4321 [x86_64-linux-5.5.0] (daily-20160510)
14Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
15
16=== START OF INFORMATION SECTION ===
17Model Family: HGST Travelstar 7K1000
18Device Model: HGST HTS721010A9E630
19Serial Number: JR10006PH7JXKE
20LU WWN Device Id: 5 000cca 6acd1859f
21Firmware Version: JB0OA3J0
22User Capacity: 1,000,204,886,016 bytes [1.00 TB]
23Sector Sizes: 512 bytes logical, 4096 bytes physical
24Rotation Rate: 7200 rpm
25Form Factor: 2.5 inches
26Device is: In smartctl database [for details use: -P show]
27ATA Version is: ATA8-ACS T13/1699-D revision 6
28SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
29Local Time is: Fri Oct 13 22:06:34 2017 UTC
30SMART support is: Available - device has SMART capability.
31SMART support is: Enabled
32
33=== START OF READ SMART DATA SECTION ===
34SMART Status not supported: Incomplete response, ATA output registers missing
35SMART overall-health self-assessment test result: PASSED
36Warning: This result is based on an Attribute check.
37
38General SMART Values:
39Offline data collection status: (0x00) Offline data collection activity
40 was never started.
41 Auto Offline Data Collection: Disabled.
42Self-test execution status: ( 0) The previous self-test routine completed
43 without error or no self-test has ever
44 been run.
45Total time to complete Offline
46data collection: ( 45) seconds.
47Offline data collection
48capabilities: (0x5b) SMART execute Offline immediate.
49 Auto Offline data collection on/off support.
50 Suspend Offline collection upon new
51 command.
52 Offline surface scan supported.
53 Self-test supported.
54 No Conveyance Self-test supported.
55 Selective Self-test supported.
56SMART capabilities: (0x0003) Saves SMART data before entering
57 power-saving mode.
58 Supports SMART auto save timer.
59Error logging capability: (0x01) Error logging supported.
60 General Purpose Logging supported.
61Short self-test routine
62recommended polling time: ( 2) minutes.
63Extended self-test routine
64recommended polling time: ( 167) minutes.
65SCT capabilities: (0x003d) SCT Status supported.
66 SCT Error Recovery Control supported.
67 SCT Feature Control supported.
68 SCT Data Table supported.
69
70
71SMART Attributes Data Structure revision number: 16
72Vendor Specific SMART Attributes with Thresholds:
73ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
74 1 Raw_Read_Error_Rate 0x000b 100 100 062 Pre-fail Always - 0
75 2 Throughput_Performance 0x0005 100 100 040 Pre-fail Offline - 0
76 3 Spin_Up_Time 0x0007 100 100 033 Pre-fail Always - 2
77 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 8
78 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
79 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
80 8 Seek_Time_Performance 0x0005 100 100 040 Pre-fail Offline - 0
81 9 Power_On_Hours 0x0012 032 032 000 Old_age Always - 29844
82 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
83 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7
84191 G-Sense_Error_Rate 0x000a 100 100 000 Old_age Always - 0
85192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 2
86193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 26
87194 Temperature_Celsius 0x0002 206 206 000 Old_age Always - 29 (Min/Max 20/37)
88196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
89197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
90198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
91199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
92223 Load_Retry_Count 0x000a 100 100 000 Old_age Always - 0
93
94SMART Error Log Version: 1
95No Errors Logged
96
97SMART Self-test log structure revision number 1
98Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
99# 1 Short offline Completed without error 00% 1060 -
100# 2 Short offline Completed without error 00% 31 -
101
102SMART Selective self-test log data structure revision number 1
103 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
104 1 0 0 Not_testing
105 2 0 0 Not_testing
106 3 0 0 Not_testing
107 4 0 0 Not_testing
108 5 0 0 Not_testing
109Selective self-test flags (0x0):
110 After scanning selected spans, do NOT read-scan remainder of disk.
111If Selective self-test is pending on power-up, resume after 0 minute delay.

00:05:29 < philectro> [philectro@ysul ~]$ dd bs=1m count=200 if=/dev/zero of=test
00:05:29 < philectro> 200+0 records in
00:05:29 < philectro> 200+0 records out
00:05:29 < philectro> 209715200 bytes transferred in 13.754970 secs (15246503 bytes/sec)

Server dead 2017-10-25.