[smartmontools-support] Defective SSD?
Alex
mysqlstudent at gmail.com
Thu Apr 9 16:00:09 CEST 2020
Hi,
A few weeks ago I reported an extremely intermittent problem I was
having with a WD30EFRX and was advised to change all the SATA cables
(not just the one connected to this WD). I've now done that and
haven't had any problems since, but as it's very intermittent, I can't
yet be sure.
This morning the server was catatonic (but still routing packets),
apparently due to a problem with an SSD that occurred a few days ago,
and apparently only now causing the server to fail:
Apr 05 01:18:14 kernel: ata1.00: exception Emask 0x0 SAct 0x7ff00001
SErr 0x0 action 0x0
Apr 05 01:18:14 kernel: ata1.00: irq_stat 0x40000008
Apr 05 01:18:14 kernel: ata1.00: failed command: READ FPDMA QUEUED
Apr 05 01:18:14 kernel: ata1.00: cmd
60/80:a8:00:2f:ca/05:00:18:00:00/40 tag 21 ncq dma 720896 in
Apr 05 01:18:14 kernel: ata1.00: status: { DRDY ERR }
Apr 05 01:18:14 kernel: ata1.00: error: { ABRT }
Apr 05 01:18:14 kernel: ata1.00: supports DRM functions and may not be
fully accessible
Apr 05 01:18:14 kernel: ata1.00: supports DRM functions and may not be
fully accessible
Apr 05 01:18:14 kernel: ata1.00: configured for UDMA/133
Apr 05 01:18:14 kernel: ata1: EH complete
It looks like this error message appeared once before on March 20th,
but has otherwise been running fine.
# hdparm -i /dev/sda
/dev/sda:
Model=Samsung SSD 750 EVO 250GB, FwRev=MAT01B6Q, SerialNo=S2SHNWAXXX
Config={ Fixed }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
BuffType=unknown, BuffSize=unknown, MaxMultSect=1, MultSect=1
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=488397168
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=no WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-2,3,4,5,6,7
smartctl says it has about 34k hours on the drive, but the numbers are
otherwise very low. "smartctl --all /dev/sda" shows 16 errors similar
to this one:
Error 16 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 00 2a ca e0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
47 00 01 30 06 00 e0 07 00:18:18.041 READ LOG DMA EXT
47 00 01 30 00 00 e0 07 00:18:18.041 READ LOG DMA EXT
47 00 01 00 00 00 e0 07 00:18:18.041 READ LOG DMA EXT
47 00 01 30 08 00 e0 07 00:18:18.041 READ LOG DMA EXT
47 00 01 13 00 00 e0 07 00:18:18.041 READ LOG DMA EXT
I've attached the full output from the above commands here. I'm hoping
someone can help me determine if it should be replaced.
Thanks,
Alex
-------------- next part --------------
# hdparm -i /dev/sda
/dev/sda:
Model=Samsung SSD 750 EVO 250GB, FwRev=MAT01B6Q, SerialNo=S2SHNWAXXX
Config={ Fixed }
RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
BuffType=unknown, BuffSize=unknown, MaxMultSect=1, MultSect=1
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=488397168
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
AdvancedPM=no WriteCache=enabled
Drive conforms to: unknown: ATA/ATAPI-2,3,4,5,6,7
* signifies the current active mode
Apr 05 01:18:14 kernel: ata1.00: exception Emask 0x0 SAct 0x7ff00001 SErr 0x0 action 0x0
Apr 05 01:18:14 kernel: ata1.00: irq_stat 0x40000008
Apr 05 01:18:14 kernel: ata1.00: failed command: READ FPDMA QUEUED
Apr 05 01:18:14 kernel: ata1.00: cmd 60/80:a8:00:2f:ca/05:00:18:00:00/40 tag 21 ncq dma 720896 in
Apr 05 01:18:14 kernel: ata1.00: status: { DRDY ERR }
Apr 05 01:18:14 kernel: ata1.00: error: { ABRT }
Apr 05 01:18:14 kernel: ata1.00: supports DRM functions and may not be fully accessible
Apr 05 01:18:14 kernel: ata1.00: supports DRM functions and may not be fully accessible
Apr 05 01:18:14 kernel: ata1.00: configured for UDMA/133
Apr 05 01:18:14 kernel: ata1: EH complete
# smartctl --all /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.12-200.fc30.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 750 EVO 250GB
Serial Number: S2SHNWAGC08387E
LU WWN Device Id: 5 002538 d70070bb6
Firmware Version: MAT01B6Q
User Capacity: 250,059,350,016 bytes [250 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Apr 9 09:46:43 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 133) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 8
9 Power_On_Hours 0x0032 093 093 000 Old_age Always - 34663
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 106
177 Wear_Leveling_Count 0x0013 001 001 000 Pre-fail Always - 1147
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 8
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 099 099 010 Pre-fail Always - 8
187 Uncorrectable_Error_Cnt 0x0032 099 099 000 Old_age Always - 16
190 Airflow_Temperature_Cel 0x0032 075 051 000 Old_age Always - 25
195 ECC_Error_Rate 0x001a 199 199 000 Old_age Always - 16
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 58
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 132274304434
SMART Error Log Version: 1
ATA Error Count: 16 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 16 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 00 2a ca e0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
47 00 01 30 06 00 e0 07 00:18:18.041 READ LOG DMA EXT
47 00 01 30 00 00 e0 07 00:18:18.041 READ LOG DMA EXT
47 00 01 00 00 00 e0 07 00:18:18.041 READ LOG DMA EXT
47 00 01 30 08 00 e0 07 00:18:18.041 READ LOG DMA EXT
47 00 01 13 00 00 e0 07 00:18:18.041 READ LOG DMA EXT
Error 15 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 51 38 80 4e ca 40 Error: at LBA = 0x00ca4e80 = 13258368
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 30 00 4e ca 40 06 00:18:18.041 READ FPDMA QUEUED
60 80 28 80 4d ca 40 05 00:18:18.041 READ FPDMA QUEUED
60 80 20 00 4d ca 40 04 00:18:18.041 READ FPDMA QUEUED
60 80 18 80 4c ca 40 03 00:18:18.041 READ FPDMA QUEUED
60 80 10 00 4c ca 40 02 00:18:18.041 READ FPDMA QUEUED
Error 14 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 51 88 80 45 ca 40 Error: at LBA = 0x00ca4580 = 13256064
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 80 00 45 ca 40 10 00:18:18.041 READ FPDMA QUEUED
60 80 78 80 44 ca 40 0f 00:18:18.041 READ FPDMA QUEUED
60 80 c0 00 44 ca 40 18 00:18:18.041 READ FPDMA QUEUED
60 80 b8 80 43 ca 40 17 00:18:18.041 READ FPDMA QUEUED
60 80 70 80 3a ca 40 0e 00:18:18.041 READ FPDMA QUEUED
Error 13 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 51 30 00 3d ca 40 Error: at LBA = 0x00ca3d00 = 13253888
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 80 28 80 3c ca 40 05 00:18:18.041 READ FPDMA QUEUED
60 80 20 00 3c ca 40 04 00:18:18.041 READ FPDMA QUEUED
60 80 18 80 3b ca 40 03 00:18:18.041 READ FPDMA QUEUED
60 80 08 00 3b ca 40 01 00:18:18.041 READ FPDMA QUEUED
60 80 00 80 3a ca 40 00 00:18:18.041 READ FPDMA QUEUED
Error 12 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
00 51 a8 00 2f ca 40 Error: at LBA = 0x00ca2f00 = 13250304
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 00 a0 00 2a ca 40 14 00:18:18.041 READ FPDMA QUEUED
60 00 98 00 25 ca 40 13 00:18:18.041 READ FPDMA QUEUED
60 80 90 80 1f ca 40 12 00:18:18.041 READ FPDMA QUEUED
60 80 88 00 1f ca 40 11 00:18:18.041 READ FPDMA QUEUED
60 80 80 80 1e ca 40 10 00:18:18.041 READ FPDMA QUEUED
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
255 0 65535 Read_scanning was never started
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
More information about the Smartmontools-support
mailing list