[smartmontools-support] Differences in test report results: xselftest vs selftest

Sat Apr 27 18:37:54 CEST 2024

Carlos E. R. wrote:
> On 2024-04-26 13:59, Daniel Fishman wrote:
>>
>> On 4/25/24 18:58, Christian Franke wrote:
>>
>>>
>>> Are any read error reported in the SMART error logs?
>>>
>>> If not, run a read test with badblocks or GNU ddrescue. If read 
>>> errors occur, the last ones should appear in the error logs. Errors 
>>> detected by self-tests only usually do not.
>>>
>>
>> Indeed, there were some errors in the "SMART Extended Comprehensive 
>> Error Log",
>> (the xerror log) but those reported blocks were readable. I ran 
>> badblocks
>> (in read-only mode), and then it did find some bad blocks which 
>> appeared in the
>> log above as well - and those ones were unreadable (with hdparm)
>>
>> Strange thing though: when I test a reported block (the ones that 
>> appeared
>> after execution of badblocks) by 'hdparm --read-sector <num> 
>> /dev/sdc', I
>> get an error, and 'Device Error Count' in xerror log increases. But when
>> I ran what I think should be an equivalent dd command:
>>
>> 'dd if=/dev/sdc of=/dev/null bs=512 count=1 seek=<num>'
>>
>> I neither get any error, nor does device error count increases. It 
>> also doesn't
>> happen when I convert block number to a file system block (using 
>> (int)((L-S)*512/B)
>> formula from the wiki) and adjust the command as follows (B is 4096 
>> in my case):
>>
>> 'dd if=/dev/sdc1 of=/dev/null bs=4096 count=1 seek=<adjusted num>'
>>

Always use 'iflag=direct' to set O_DIRECT open() flag which should skip 
disk buffering. Otherwise the OS may return the buffered result of an 
earlier successful read.

In the past, I have occasionally seen disks with weak sectors where only 
a small amount of reads fail. This may be mostly hidden by OS due to its 
internal read retries - but should be visible in kernel log and disk 
error log. A time limit of retries (ERC, Error Recovery Control) done by 
disk firmware itself could usually be configured with 'smartctl -l 
scterc,...', see man page. The problem with such weak sectors may be 
that the firmware may not relocate the sector upon next write command.

Another source of read errors could be seek or head problems which may 
result in occasional read errors of 'random' LBAs spread over the whole 
disk. An interesting case from a few weeks ago: A 4TB Seagate 
ST4000NM0265 with massive of such read problems could be completely 
(except for one single sector) recovered by running (the Cygwin version 
of) GNU ddrescue for about two weeks 24x7. Required changing ddrescue 
parameters several times and reduction of disk's ERC time.

>> Is there a reason for that? I suppose that there is a good reason why 
>> wiki
>> instructs to use 'dd' and only if it doesn't work - use hdparm (probably
>> because it adds confidence that there was no error in converting LBA to
>> file system block) - and therefore I would also prefer to use dd both 
>> for
>> actually overwriting the block, and for reading it (and getting error)
>> to make sure that I don't fix something wrong.
>
> I wonder if this might work:
>
> read with dd the supposed bad sector into a temporary file
> write back that temporary file into that same bad sector. That way it 
> doesn't matter if the sector is not actually bad.

Alternatively find out which file is affected by the bad sector with 
'ifind' and 'ffind' from sleuthkit. Copy the file if possible and then 
overwrite the original in place with 'shred'. The latter possibly 
redirects the bad sector and does not require root rights. See this real 
word use case:
https://www.smartmontools.org/wiki/BadBlockHowto#RecoveringamostlyunreadablesectorofaNotebookHDD