[smartmontools-support] Reasonable config of smartd to determine when to discard a disk (SATA, NVMe)

Claudio Kuenzler napsty at gmail.com
Thu Nov 23 12:52:51 CET 2023


Hej!

On Thu, Nov 23, 2023 at 12:34 PM Erik Starbäck <erik.starback at uppmax.uu.se>
wrote:

>
> For all our nodes with HW-raid we think it is rather clear... when the
> raid kicks out a disk. We discard it. We let the HW-RAID make the decision
> for us.
>

Basically a hw-raid controller or a server firmware kicks a drive out of
the array when a certain threshold of certain SMART attributes is reached.
It could also just check the current SMART Health value and kick a disk out
when the disk is considered dead. This is proprietary software so we don't
know exactly what triggers it.
By looking at (and not ignoring) the SMART values and their warnings, you
can be informed way in advance before a disk actually dies. This gives you
time for preparations, such as ordering replacements and scheduling
maintenance etc. I've documented such an example here:
https://www.claudiokuenzler.com/blog/469/multiple-several-ways-monitor-physical-hard-drive-disk


> For example: I got mails about "Device: /dev/sda [SAT], 8 Currently
> unreadable (pending) sectors". Is it insane to discard that disk?
>

This means that you (most likely) have a sector which went bad. From
experience I can tell you that there's never a single bad/defect sector;
it's a matter of time until the counter increases.
I have the following personal rule of thumb for warnings showing up.
Production server: Order or prepare replacement from inventory, exchange as
soon as possible (no urgency, but the quicker the better)
Non-production server: No hurry - watch the counter increase for a while
and then replace the drive


> With this line I realize smartd can mail with 152 different subjects. How
> could I decide what action to make?
>

Yes, there are a couple of SMART attributes and not all of them make sense
to jump out of your seat. I find the SMART article on Wikipedia (
https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology)
quite helpful. Scroll down to "Known ATA SMART Attributes" and look at the
critical column. This is a helpful indicator.

cheers,
ck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://listi.jpberlin.de/pipermail/smartmontools-support/attachments/20231123/78d15b22/attachment.htm>


More information about the Smartmontools-support mailing list