[smartmontools-support] Defective SSD?

Alex mysqlstudent at gmail.com
Thu Apr 9 16:00:09 CEST 2020


Hi,

A few weeks ago I reported an extremely intermittent problem I was
having with a WD30EFRX and was advised to change all the SATA cables
(not just the one connected to this WD). I've now done that and
haven't had any problems since, but as it's very intermittent, I can't
yet be sure.

This morning the server was catatonic (but still routing packets),
apparently due to a problem with an SSD that occurred a few days ago,
and apparently only now causing the server to fail:

Apr 05 01:18:14 kernel: ata1.00: exception Emask 0x0 SAct 0x7ff00001
SErr 0x0 action 0x0
Apr 05 01:18:14 kernel: ata1.00: irq_stat 0x40000008
Apr 05 01:18:14 kernel: ata1.00: failed command: READ FPDMA QUEUED
Apr 05 01:18:14 kernel: ata1.00: cmd
60/80:a8:00:2f:ca/05:00:18:00:00/40 tag 21 ncq dma 720896 in
Apr 05 01:18:14 kernel: ata1.00: status: { DRDY ERR }
Apr 05 01:18:14 kernel: ata1.00: error: { ABRT }
Apr 05 01:18:14 kernel: ata1.00: supports DRM functions and may not be
fully accessible
Apr 05 01:18:14 kernel: ata1.00: supports DRM functions and may not be
fully accessible
Apr 05 01:18:14 kernel: ata1.00: configured for UDMA/133
Apr 05 01:18:14 kernel: ata1: EH complete

It looks like this error message appeared once before on March 20th,
but has otherwise been running fine.

# hdparm -i /dev/sda

/dev/sda:

 Model=Samsung SSD 750 EVO 250GB, FwRev=MAT01B6Q, SerialNo=S2SHNWAXXX
 Config={ Fixed }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=unknown, MaxMultSect=1, MultSect=1
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=488397168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-2,3,4,5,6,7

smartctl says it has about 34k hours on the drive, but the numbers are
otherwise very low. "smartctl --all /dev/sda" shows 16 errors similar
to this one:

Error 16 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 00 2a ca e0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  47 00 01 30 06 00 e0 07      00:18:18.041  READ LOG DMA EXT
  47 00 01 30 00 00 e0 07      00:18:18.041  READ LOG DMA EXT
  47 00 01 00 00 00 e0 07      00:18:18.041  READ LOG DMA EXT
  47 00 01 30 08 00 e0 07      00:18:18.041  READ LOG DMA EXT
  47 00 01 13 00 00 e0 07      00:18:18.041  READ LOG DMA EXT

I've attached the full output from the above commands here. I'm hoping
someone can help me determine if it should be replaced.

Thanks,
Alex
-------------- next part --------------
# hdparm -i /dev/sda

/dev/sda:

 Model=Samsung SSD 750 EVO 250GB, FwRev=MAT01B6Q, SerialNo=S2SHNWAXXX
 Config={ Fixed }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=unknown, MaxMultSect=1, MultSect=1
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=488397168
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4 
 DMA modes:  mdma0 mdma1 mdma2 
 UDMA modes: udma0 udma1 udma2 udma3 udma4 udma5 *udma6 
 AdvancedPM=no WriteCache=enabled
 Drive conforms to: unknown:  ATA/ATAPI-2,3,4,5,6,7

 * signifies the current active mode

Apr 05 01:18:14 kernel: ata1.00: exception Emask 0x0 SAct 0x7ff00001 SErr 0x0 action 0x0
Apr 05 01:18:14 kernel: ata1.00: irq_stat 0x40000008
Apr 05 01:18:14 kernel: ata1.00: failed command: READ FPDMA QUEUED
Apr 05 01:18:14 kernel: ata1.00: cmd 60/80:a8:00:2f:ca/05:00:18:00:00/40 tag 21 ncq dma 720896 in
Apr 05 01:18:14 kernel: ata1.00: status: { DRDY ERR }
Apr 05 01:18:14 kernel: ata1.00: error: { ABRT }
Apr 05 01:18:14 kernel: ata1.00: supports DRM functions and may not be fully accessible
Apr 05 01:18:14 kernel: ata1.00: supports DRM functions and may not be fully accessible
Apr 05 01:18:14 kernel: ata1.00: configured for UDMA/133
Apr 05 01:18:14 kernel: ata1: EH complete

# smartctl --all /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.3.12-200.fc30.x86_64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 750 EVO 250GB
Serial Number:    S2SHNWAGC08387E
LU WWN Device Id: 5 002538 d70070bb6
Firmware Version: MAT01B6Q
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Apr  9 09:46:43 2020 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x53) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 133) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   099   099   010    Pre-fail  Always       -       8
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       34663
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       106
177 Wear_Leveling_Count     0x0013   001   001   000    Pre-fail  Always       -       1147
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   099   099   010    Pre-fail  Always       -       8
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   099   099   010    Pre-fail  Always       -       8
187 Uncorrectable_Error_Cnt 0x0032   099   099   000    Old_age   Always       -       16
190 Airflow_Temperature_Cel 0x0032   075   051   000    Old_age   Always       -       25
195 ECC_Error_Rate          0x001a   199   199   000    Old_age   Always       -       16
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       58
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       132274304434

SMART Error Log Version: 1
ATA Error Count: 16 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 16 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 00 2a ca e0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  47 00 01 30 06 00 e0 07      00:18:18.041  READ LOG DMA EXT
  47 00 01 30 00 00 e0 07      00:18:18.041  READ LOG DMA EXT
  47 00 01 00 00 00 e0 07      00:18:18.041  READ LOG DMA EXT
  47 00 01 30 08 00 e0 07      00:18:18.041  READ LOG DMA EXT
  47 00 01 13 00 00 e0 07      00:18:18.041  READ LOG DMA EXT

Error 15 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 51 38 80 4e ca 40  Error:  at LBA = 0x00ca4e80 = 13258368

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 80 30 00 4e ca 40 06      00:18:18.041  READ FPDMA QUEUED
  60 80 28 80 4d ca 40 05      00:18:18.041  READ FPDMA QUEUED
  60 80 20 00 4d ca 40 04      00:18:18.041  READ FPDMA QUEUED
  60 80 18 80 4c ca 40 03      00:18:18.041  READ FPDMA QUEUED
  60 80 10 00 4c ca 40 02      00:18:18.041  READ FPDMA QUEUED

Error 14 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 51 88 80 45 ca 40  Error:  at LBA = 0x00ca4580 = 13256064

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 80 80 00 45 ca 40 10      00:18:18.041  READ FPDMA QUEUED
  60 80 78 80 44 ca 40 0f      00:18:18.041  READ FPDMA QUEUED
  60 80 c0 00 44 ca 40 18      00:18:18.041  READ FPDMA QUEUED
  60 80 b8 80 43 ca 40 17      00:18:18.041  READ FPDMA QUEUED
  60 80 70 80 3a ca 40 0e      00:18:18.041  READ FPDMA QUEUED

Error 13 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 51 30 00 3d ca 40  Error:  at LBA = 0x00ca3d00 = 13253888

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 80 28 80 3c ca 40 05      00:18:18.041  READ FPDMA QUEUED
  60 80 20 00 3c ca 40 04      00:18:18.041  READ FPDMA QUEUED
  60 80 18 80 3b ca 40 03      00:18:18.041  READ FPDMA QUEUED
  60 80 08 00 3b ca 40 01      00:18:18.041  READ FPDMA QUEUED
  60 80 00 80 3a ca 40 00      00:18:18.041  READ FPDMA QUEUED

Error 12 occurred at disk power-on lifetime: 34559 hours (1439 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  00 51 a8 00 2f ca 40  Error:  at LBA = 0x00ca2f00 = 13250304

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 a0 00 2a ca 40 14      00:18:18.041  READ FPDMA QUEUED
  60 00 98 00 25 ca 40 13      00:18:18.041  READ FPDMA QUEUED
  60 80 90 80 1f ca 40 12      00:18:18.041  READ FPDMA QUEUED
  60 80 88 00 1f ca 40 11      00:18:18.041  READ FPDMA QUEUED
  60 80 80 80 1e ca 40 10      00:18:18.041  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



More information about the Smartmontools-support mailing list