Fedora Users — Re: understanding smart logs

Suvayu Ali wrote:
> Hi everyone,
>
> Some background:
> Recently my RAM went bad, and I realised it too late. Towards the last 
> few of days my desktop had crashed more than once. Yesterday I received 
> the replacement RAMs from RMA. On installing them and turning on my 
> machine I noticed errors like these,
>
> Device: /dev/sdb [SAT], 172 Currently unreadable (pending) sectors
>
> And I see that the errors started around about the time my desktop 
> started crashing before I found the faulty RAMs.
>
> The problem:
> On subsequent boots it failed to boot, fsck complaining about disk read 
> errors during a forced disk check. I was dropped to a read-only shell to 
> troubleshoot everytime, so I ran fsck on all my partitions and found 
> errors on my /home. The error messages said "inode has deleted or empty 
> entries clear", "unlinked inode entries" and so on. Since I was on a 
> read only partition I couldn't save them on a file (I guess paper would 
> have worked :-p). When prompted by fsck to fix the errors, I answered yes.
>
> On a reboot, my system booted properly but I had lost some very 
> important data. All the missing directories were the ones which fsck had 
> complained about. I restored whatever I could from some backups.
>
> To confirm this as a one off incident and my disk hasn't gone bad I ran 
> SMART tests, (this is a few month old drive)
> # smartctl -t long /dev/sdb
>
> But after the test I can't understand the output of the logs,
>
>   
>> # smartctl -a /dev/sdb
>> smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build)
>> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>>
>> === START OF INFORMATION SECTION ===
>> Model Family:     Western Digital Caviar Black family
>> Device Model:     WDC WD1001FALS-00E8B0
>> Serial Number:    WD-WMATV5966482
>> Firmware Version: 05.00K05
>> User Capacity:    1,000,204,886,016 bytes
>> Device is:        In smartctl database [for details use: -P show]
>> ATA Version is:   8
>> ATA Standard is:  Exact ATA specification draft version not indicated
>> Local Time is:    Sat Aug 14 19:37:26 2010 PDT
>> SMART support is: Available - device has SMART capability.
>> SMART support is: Enabled
>>
>> === START OF READ SMART DATA SECTION ===
>> SMART overall-health self-assessment test result: PASSED
>>
>> General SMART Values:
>> Offline data collection status:  (0x84)	Offline data collection activity
>> 					was suspended by an interrupting command from host.
>> 					Auto Offline Data Collection: Enabled.
>> Self-test execution status:      ( 121)	The previous self-test completed having
>> 					the read element of the test failed.
>> Total time to complete Offline
>> data collection: 		 (18000) seconds.
>> Offline data collection
>> capabilities: 			 (0x7b) SMART execute Offline immediate.
>> 					Auto Offline data collection on/off support.
>> 					Suspend Offline collection upon new
>> 					command.
>> 					Offline surface scan supported.
>> 					Self-test supported.
>> 					Conveyance Self-test supported.
>> 					Selective Self-test supported.
>> SMART capabilities:            (0x0003)	Saves SMART data before entering
>> 					power-saving mode.
>> 					Supports SMART auto save timer.
>> Error logging capability:        (0x01)	Error logging supported.
>> 					General Purpose Logging supported.
>> Short self-test routine
>> recommended polling time: 	 (   2) minutes.
>> Extended self-test routine
>> recommended polling time: 	 ( 208) minutes.
>> Conveyance self-test routine
>> recommended polling time: 	 (   5) minutes.
>> SCT capabilities: 	       (0x3037)	SCT Status supported.
>> 					SCT Feature Control supported.
>> 					SCT Data Table supported.
>>
>> SMART Attributes Data Structure revision number: 16
>> Vendor Specific SMART Attributes with Thresholds:
>> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>>   1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       1354
>>   3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       1158
>>   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       40
>>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
>>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
>>   9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1403
>>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
>>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
>>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       38
>> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       21
>> 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       18
>> 194 Temperature_Celsius     0x0022   112   107   000    Old_age   Always       -       38
>> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
>> 197 Current_Pending_Sector  0x0032   199   199   000    Old_age   Always       -       172
>> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
>> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
>> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
>>
>> SMART Error Log Version: 1
>> No Errors Logged
>>
>> SMART Self-test log structure revision number 1
>> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
>> # 1  Extended offline    Completed: read failure       90%      1393         1106820646
>>
>> SMART Selective self-test log data structure revision number 1
>>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>>     1        0        0  Not_testing
>>     2        0        0  Not_testing
>>     3        0        0  Not_testing
>>     4        0        0  Not_testing
>>     5        0        0  Not_testing
>> Selective self-test flags (0x0):
>>   After scanning selected spans, do NOT read-scan remainder of disk.
>> If Selective self-test is pending on power-up, resume after 0 minute delay.
>>     
>
> All the values in the table above seems larger than the threshold. But 
> the report says PASSED. I'm not clear how to interpret this. Could 
> someone help? Thanks a lot in advance.
>
>   
Got a good backup of this drive?  Looks like it needs to be retested, in 
a different machine and if it fails, replaced.

I had a drive that exhibited the same behavior and eventually, it failed.

James McKenzie

-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines