[smartmontools-support] [PATCH v7] scsi: Add hwmon support for SMART temperature sensors

Martin K. Petersen martin.petersen at oracle.com
Sat Nov 24 00:26:28 CET 2018


Hi Linus,

>> The problem with all this is that the storage topology is largely
>> undiscoverable for monitoring purposes. We can use heuristics, but in
>> many cases there is no reliable way to find out that there is an ATA
>> device behind member #3 of a USB-attached RAID controller's virtual disk
>> #5.
>
> OK I guess they just opt out of it?

There is no way to discover the actual topology in a vendor-agnostic
way. See all the -d options to smartctl and how discovery is left as an
exercise to the user in many cases :(

>> You could then argue that the kernel should only provide sensors for a
>> trivial subset of configurations such as direct-attached ATA/SAS/USB
>> devices that provide sufficient heuristics to ensure we don't
>> accidentally send commands down that may wedge the device.
>
> This is what the current patch does ... it's an opt-in per-subsystem.

Yep, but I'm afraid that may not be a good enough granularity. There are
tons of USB devices out there that wedge when you send them a command
they didn't expect (temperature sensor on a USB stick?). So there needs
to be some sort of heuristic in place. In this case the right way to
signal "I'm an ATA device" would be to provide the ATA Information VPD
page. Unfortunately, almost no USB/UAS/FireWire devices fill that out.

To solve other, similar headaches I have been toying with the idea of
teaching usb-storage/uas to present an intermediary libata-like SAT for
the primary (non-I/O) commands instead of relying on the device
implementation doing the right thing. This would clean up some of the
extensive blacklist/whitelist hacks we currently carry in SCSI due to
misbehaving devices.

> I just opted in libata PATA devices (pretty obvious this will work)
> and USB, and tested a bit with different devices there, nothing seems
> to break, the ATA disks behind USB transport works fine and
> report temperature just fine.

libata should be reasonably safe, USB & UAS definitely carry some risk.

>> (It's also worth noting that HDD temperature sensors are notoriously
>> unreliable).
>
> I am sorry if you think that D-Link does bad engineering, what I
> am trying to achieve is upstream support for this device, without
> any out-of-tree patches. The D-Link DIR-685 uses the harddisk
> sensor for this, whether we like it or not.

Sad, but not surprising.

Meanwhile elsewhere we have had drive vendors begging us to ignore the
reported temperature. We have measured in excess of 25 degC difference
between the drive PCB temp sensor and the sensor in the drive bay. We
have also had cases where the drives reported "lp0 on fire" despite the
bay temperature being nominal.

So I think it is imperative that no action is taken in response to the
drive-reported values without explicit opt-in from the user (or in your
NAS case maybe some device tree platform enablement). There is a reason
that all this S.M.A.R.T. stuff is disabled by default and left for the
admin to configure. S.M.A.R.T. and its properly standardized successors
have improved, but so far we have had way too many false positives to
entertain turning it on by default. I would absolutely love for things
to Just Work but unfortunately it hasn't been feasible to go there.

Taking a step back: You are doing this so that you can feed drive
temperature sensor data to hwmon from inside the kernel. Thought about
feeding it temperature data from smartctl in userland instead?  Just
wondering if the path of least resistance is leveraging the 70Klines of
existing temperature sensor reading heuristics that is smartmontools
(admittedly some of that is non-Linux support).

> I hope this is possible without having to buy and implement the same
> mechanism also for SCSI drives. I don't have any SCSI devices...

# modprobe scsi_debug
# sg_logs /dev/sda
    Linux     scsi_debug        0184
Supported log pages  (spc-2) [0x0]:
    0x00        Supported log pages
    0x0d        Temperature
    0x2f        Informational exceptions (SMART)

> Am I right in that the modepages for libata is the stuff inside
> drivers/ata/libata-scsi.c, like the stuff on the very top with the
> cache_mpage[] and def_control_mpage[]?

Essentially, yes. Except the temperature is in a log page and not a mode
(device configuration) page. Pretty much the same thing.

> These are all generated in response to the ata_scsiop_mode_sense()
> callback from ata_scsi_simulate() in response to MODE_SENSE and
> MODE_SENSE_10 commands.

Yeah, so you'd have to mirror this to implement LOG SENSE.

> - Add a case for LOG_SENSE (0x4d) in ata_scsi_simulate()
>
> - Prepare a callback and provide a mode page 0x0d from there.
>
> - Provide a modepage 0x0d in response to that command from SCSI.

Log page, yes.

> - Implement some code to request and deal with that modepage in
> drivers/scsi to register the hwmon sensor

Correct.

> Architecturally I see the upside of this, but I also see a problem:
> the modepage simulation would be useful not only for libata but also
> (as is proved by testing with USB cradles) from USB storage as
> well. But I guess I can figure that out, it's essentially just a piece
> of libata that USB need to share.

Yep, that's why I suggested making it a "libsmart" so that the code
could be leveraged by usb-storage/uas.

> The thing/command I pass in now is ATA_16 (0x85) 16-byte pass-thru, I
> take it that a ATA_16 pass thru is NOT a proper command or modepage
> but something like an uglyhack?...

ATA_16 is a pass-thru command defined in the T10 SAT (SCSI-ATA
Translation) specification. It acts as a conduit for sending an
encapsulated ATA command through a SCSI device.

You also need to add the ability to stop sending these commands if they
fail. Sending unsupported commands to a device can be quite disruptive
and impact performance for other I/O. So if the passthrough fails, you
should permanently disable the feature for that device. That's how we
usually deal with these things since there often isn't a way to
determine up front if something is going to work or not.

-- 
Martin K. Petersen	Oracle Linux Engineering



More information about the Smartmontools-support mailing list