[smartmontools-support] Should I worry?

Sat Nov 23 20:36:59 CET 2019

On Sat, Nov 23, 2019 at 13:36:46 +0100, Jørn Dahl-Stamnes wrote:
> Hello,
> 
> got a disk that have unreadable sectors. From logwatch I get this message:
> 
> Currently unreadable (pending) sectors detected:
>         /dev/sda [SAT] - 96 Time(s)
>         2 unreadable sectors detected
> 
> Is this a sign that this disk are about to die?

Well, the precise meaning of this warning is that the drive found 2
sectors that it was unable to read successfully.

This is certainly a sign that something "not good" has happened on that
drive, and so some people/sites have a policy of simply replacing the
drive as soon as any such errors happen, figuring "better safe than
sorry".

However, it's also possible for the sectors to be unreadable due to
one-time problems, and that the drive will continue to work fine for a
long time once you resolve these particular errors.

So basically you have to weigh the hassle/cost of replacing the drive
now against the danger of it failing suddenly if you don't, based both  
on how the drive is being used and what the data SMART is telling you.

Unfortunately you can't predict with any certainty just from the current
situation what will happen to the drive in the future, so given the
existance of the errors its definitely wise to make sure your backups
are happening regularly and probably a good idea to have a replacement
disk on hand hand in case this one starts to fail more drastically. 

I've seen many disks with this sort of error work fine for years after
fixing the errors... but also a few which just had more and more bad
sectors over a period of a few weeks after the first errors showed up
(at which point we proceeded to replace them).

> 
> $ smartct -a /dev/sda:
> 
[...]
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED 
> RAW_VALUE
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
[...]
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
[...]

In this case the Reallocated_Sector_Ct is still zero and there are only
two Current_Pending_Sector sectors, so it still seems plausible that
the errors are "self contained" rather then a sign of a broader failure.

If you see Reallocated_Sector_Ct climbing over time, or are unable to
clear the Current_Pending_Sector sector count back to zero, then I'd
start to be convinced there was a more general failure sitaution.

> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours)  
> LBA_of_first_error
> # 1  Extended offline    Completed: read failure       90%     24838         1857598668
> # 2  Short offline       Completed: read failure       90%     24836         1857598666
> # 3  Short offline       Completed: read failure       90%     24812         1857598664
> # 4  Short offline       Completed: read failure       90%     24789         1857598664
> # 5  Short offline       Completed: read failure       90%     24765         1857598664
> # 6  Short offline       Completed: read failure       90%     24741         1857598666
> # 7  Short offline       Completed: read failure       90%     24717         1857598666
> # 8  Short offline       Completed: read failure       90%     24693         1857598664
> # 9  Extended offline    Completed: read failure       90%     24671         1857598665
> #10  Short offline       Completed: read failure       90%     24669         1857598665
> #11  Short offline       Completed: read failure       90%     24645         1857598664
> #12  Short offline       Completed without error       00%     24621         -

The interesting thing here is that you are getting consistent read
failures within a few-sector range, so if you can identify what data is
stored on those sectors it should be fairly straightforward to rewrite
those particular ones (keeping in mind that this drive has 4kiB physical
sectors so you have to rewrite 8 logical sectors at once in order to
rewrite one physical one), thus either clearing the
Current_Pending_Sector count or finding that you flush out further
errors.

How difficult it will be to identify the data in question will depend
on how your disk is organized, but one place to start would be:
  https://www.smartmontools.org/wiki/BadBlockHowto

								Nathan

----------------------------------------------------------------------------
Nathan Stratton Treadway  -  nathanst at ontko.com  -  Mid-Atlantic region
Ray Ontko & Co.  -  Software consulting services  -   http://www.ontko.com/
 GPG Key: http://www.ontko.com/~nathanst/gpg_key.txt   ID: 1023D/ECFB6239
 Key fingerprint = 6AD8 485E 20B9 5C71 231C  0C32 15F3 ADCD ECFB 6239