The Death and Diagnosis of a Dying Hard Drive - Is S.M.A.R.T. useful?

SUMMARY:

I pose the following question in the subject, as over the years runningsmartd and having failed disks, I have always first been alerted of badsectors and such through dmesg or logcheck. Even with a bad disk Icurrently have, smartd does not pickup any errors, except those with thekernel writes to syslog.


LKML INFO:

I've cc'd the LKML to show that, when a disk is failing I had receivedsimilar stat errors, but those were due to buffer / or other disk issues.


[4485617.826000] ata2: status=0x51 { DriveReady SeekComplete Error }

[4485619.292000] ata2: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ0x3/11/04

[4485619.292000] ata2: status=0x51 { DriveReady SeekComplete Error }

[4485620.749000] ata2: translated ATA stat/err 0x51/40 to SCSI SK/ASC/ASCQ0x3/11/04

[4485620.749000] ata2: status=0x51 { DriveReady SeekComplete Error }
[4494582.951000] ata2: command 0x25 timeout, stat 0x50 host_stat 0x22
[4494831.267000] ata2: command 0x25 timeout, stat 0x50 host_stat 0x22

--------------

Now for the problem and analysis:

The Death and Diagnosis of a Dying Hard Drive - Is S.M.A.R.T. useful?

1] SMARTMONTOOLS: I pose the following question: Is running the smartd daemon
with short and long S.M.A.R.T. tests enough?

2] FAILED HARD DRIVE: A Maxtor of course! (1.38 years old)
------------------------- snip -------------------------------------------------
Model Family:     Maxtor DiamondMax 10 family
Device Model:     Maxtor 6B250S0
Serial Number:    ******** (out of warranty on 02/19/2006)
Firmware Version: BANC1B70
User Capacity:    251,000,193,024 bytes
------------------------- snip -------------------------------------------------

3] DMESG DATA DUMP: Occured while [reading] a file from the HDD.
------------------------- snip -------------------------------------------------
ATA: abnormal status 0x80 on port 0xC807
ATA: abnormal status 0x80 on port 0xC807
ATA: abnormal status 0x80 on port 0xC807
ata2: command 0x25 timeout, stat 0x80 host_stat 0x21
ata2: translated ATA stat/err 0x80/00 to SCSI SK/ASC/ASCQ 0xb/47/00
ata2: status=0x80 { Busy }
sd 2:0:0:0: SCSI error: return code = 0x8000002
sdc: Current: sense key=0xb
    ASC=0x47 ASCQ=0x0
end_request: I/O error, dev sdc, sector 130483823
ATA: abnormal status 0x80 on port 0xC807
ATA: abnormal status 0x80 on port 0xC807
ATA: abnormal status 0x80 on port 0xC807
ata2: command 0x25 timeout, stat 0x50 host_stat 0x21
------------------------- snip -------------------------------------------------

4] SMARTCTL-SHORT TEST: The short shows nothing wrong with the drive.
------------------------- snip -------------------------------------------------
# smartctl -d ata -t short /dev/sdc
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error
# 1  Short offline       Completed without error       00%     12097         -
------------------------- snip -------------------------------------------------

5] SMARTCTL-LONG TEST:
------------------------- snip -------------------------------------------------
# smartctl -d ata -t short /dev/sdc
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
# 1  Extended offline    Completed without error       00%     12099         -
------------------------- snip -------------------------------------------------

6] TRY OTHER METHOD USE DD.
------------------------- snip -------------------------------------------------
# /usr/bin/time dd if=/dev/sdc bs=4096 | pipebench > /x6/failed_hdd.img
# This also checked out but some interesting messages in dmesg:
ata3: no sense translation for status: 0x51
ata3: translated ATA stat/err 0x51/00 to SCSI SK/ASC/ASCQ 0x3/11/04
ata3: status=0x51 { DriveReady SeekComplete Error }
------------------------- snip -------------------------------------------------

7] CHECK WITH BADBLOCKS(READ-ONLY)...?
------------------------- snip -------------------------------------------------
# /usr/bin/time badblocks -b 512 -s -v /dev/sdc
 -b 512 -s -v /dev/sdhecking blocks 0 to 490234752

Checking for bad blocks (read-only test): donePass completed, 0 bad blocks found.

5.56user 439.85system 1:31:29elapsed 8%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+230minor)pagefaults 0swaps
# mount -a
------------------------- snip -------------------------------------------------


8] CHECK WITH BADBLOCKS(READ+WRITE)...?
------------------------- snip -------------------------------------------------

# /usr/bin/time badblocks -b 512 -s -v -w /dev/sdcChecking for bad blocks in read-write mode

From block 0 to 490234752

Testing with pattern 0xaa:       369800128/      490234752
------------------------- snip -------------------------------------------------
After 12 hours of testing, FINALLY, it says I have a bad disk, see below.
233537658
233537659
233537660
233537661
233537662
233537663
done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 26368 bad blocks found.
1496.54user 3582.18system 12:14:45elapsed 11%CPU (0avgtext+0avgdata 0maxresident)k0inputs+0outputs (2major+282minor)pagefaults 0swaps

-- Also in dmesg:

System Events
=-=-=-=-=-=-=
Jun  9 23:14:51 p34 smartd[32213]: Device: /dev/sdc, 1 Currently unreadable
(pending) sectors
Jun  9 23:44:52 p34 smartd[32213]: Device: /dev/sdc, 1 Currently unreadable
(pending) sectors

------------------------- snip -------------------------------------------------

9] Now review the SMART log again!
------------------------- snip -------------------------------------------------
Error 252 occurred at disk power-on lifetime: 11354 hours (473 days + 2 hours)

When the command that caused the error occurred, the device was in an unknownstate.


  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  78 00 08 b0 19 eb e0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  00 00 08 b0 19 eb e0 00      01:39:12.107  NOP [Abort queued commands]
  00 00 08 b0 19 eb e0 00      01:39:10.649  NOP [Abort queued commands]
  00 00 08 b0 19 eb e0 00      01:39:09.191  NOP [Abort queued commands]
  00 00 08 b0 19 eb e0 00      01:39:07.716  NOP [Abort queued commands]
  00 00 08 b0 19 eb e0 00      01:39:06.258  NOP [Abort queued commands]

Error 251 occurred at disk power-on lifetime: 11354 hours (473 days + 2 hours)

When the command that caused the error occurred, the device was in an unknownstate.


  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  78 00 08 b0 19 eb e0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  00 00 08 b0 19 eb e0 00      01:39:10.649  NOP [Abort queued commands]
  00 00 08 b0 19 eb e0 00      01:39:09.191  NOP [Abort queued commands]
  00 00 08 b0 19 eb e0 00      01:39:07.716  NOP [Abort queued commands]
  00 00 08 b0 19 eb e0 00      01:39:06.258  NOP [Abort queued commands]
  00 00 08 b0 19 eb e0 00      01:39:04.791  NOP [Abort queued commands]
------------------------- snip -------------------------------------------------

10] What about those self-tests, do they find anything now? Nope.
------------------------- snip -------------------------------------------------
# smartctl -d ata -t short /dev/sdc
Nope.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error
# 1  Short offline       Completed without error       00%     12116         -
------------------------- snip -------------------------------------------------

11] What about the long test? Does not find anything.
------------------------- snip -------------------------------------------------
# smartctl -d ata -t long /dev/sdc
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error
# 1  Extended offline    Completed without error       00%     12117         -
------------------------- snip -------------------------------------------------

After all of this testing, I must pose the question to all of those who run

smartd, is it worth running with scheduled short/long tests if they donot find the errors that badblocks did?


Please advise.

Thanks,

Justin.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: The Death and Diagnosis of a Dying Hard Drive - Is S.M.A.R.T. useful?
  - From: Jan-Benedict Glaw <[email protected]>

Prev by Date: Re: 2.6.18 -mm merge plans
Next by Date: Re: 2.6.16-rc6-mm2
Previous by thread: 2.6.17-rc6-rt3
Next by thread: Re: The Death and Diagnosis of a Dying Hard Drive - Is S.M.A.R.T. useful?
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]