[smartmontools-support] Automatically alert the user on S.M.A.R.T. health warnings

Mon Nov 22 20:22:44 CET 2021

Hi all:

I have a number of PCs to maintain. Every now and them, I log onto each one of them and check if everything is alright. The most important checks are 
whether the disks are getting full, whether there are S.M.A.R.T. warnings, and whether the operating system updates are up-to-date.

After some time, I decided to learn some powerful, popular monitoring system that could automatically alert me about these kinds of problems. There 
would be some learning curve due to time-series databases etc., but it would be worth it. My very simple and common PC health checks will surely be 
built-in, ready to use. So I chose Prometheus to start with.

Of course, I should have known better. Nothing is easy or built-in.

But let's focus in S.M.A.R.T. The Prometheus Node Exporter provides metrics for Unix/Linux and it is packaged by Ubuntu/Debian. I believe that it uses 
smartmontools behind the scenes. See package "prometheus-node-exporter-collectors". The script inside is here:

http://devel.dob.sk/collectd-scripts/

I heard that the Windows Exporter for Microsoft Windows can use smartmontools too, but I haven't confirmed it yet.

The S.M.A.R.T. metrics provided look like this:

smartmon_reallocated_sector_ct_raw_value{disk="/dev/sda",smart_id="5",type="sat"} 0
smartmon_reallocated_sector_ct_threshold{disk="/dev/sda",smart_id="5",type="sat"} 36
smartmon_reallocated_sector_ct_value{disk="/dev/sda",smart_id="5",type="sat"} 100
smartmon_reallocated_sector_ct_worst{disk="/dev/sda",smart_id="5",type="sat"} 100

I can use Prometheus to generate alarms on such metrics. Can someone help me do that?

The advice I am looking for here is not actually specific to Prometheus. The main problem is that I am not familiar with the S.M.A.R.T. values. What 
is the difference between "raw value" and "value"? Which one do I need to compare against "threshold"? Should that be <, <=, > or >= ?

I am guessing that it is not easy to display meaningful, comprehensible values (like a temperature in celsius degrees), out of such values. But it 
does not matter: as soon as an alert comes in, I can go and look at it with some nice S.M.A.R.T. GUI.

If dealing with those metrics is complicated, perhaps there is another way. Modern versions of Windows are now interpreting such metrics themselves, 
and generating a general boolean health indicator, see:

(Get-WmiObject -Namespace root\wmi –Class MSStorageDriver_FailurePredictStatus).PredictFailure

Is there a way to generate such a simple indicator with smartmontools? I could then run a script at regular intervals, and feed the result to Prometheus.

For reference, I have posted a related question in the Prometheus mailing list:

   How to generate alerts for S.M.A.R.T. warnings and errors
   https://groups.google.com/g/prometheus-users/c/fcDG_2Ny7F4

A related GitHub Issue is:

   Feature Request: SMART-values
   https://github.com/prometheus-community/windows_exporter/issues/78

Thanks in advance,
   rdiez