Suvayu Ali wrote: > Hi everyone, > > Some background: > Recently my RAM went bad, and I realised it too late. Towards the last > few of days my desktop had crashed more than once. Yesterday I received > the replacement RAMs from RMA. On installing them and turning on my > machine I noticed errors like these, > > Device: /dev/sdb [SAT], 172 Currently unreadable (pending) sectors > > And I see that the errors started around about the time my desktop > started crashing before I found the faulty RAMs. > > The problem: > On subsequent boots it failed to boot, fsck complaining about disk read > errors during a forced disk check. I was dropped to a read-only shell to > troubleshoot everytime, so I ran fsck on all my partitions and found > errors on my /home. The error messages said "inode has deleted or empty > entries clear", "unlinked inode entries" and so on. Since I was on a > read only partition I couldn't save them on a file (I guess paper would > have worked :-p). When prompted by fsck to fix the errors, I answered yes. > > On a reboot, my system booted properly but I had lost some very > important data. All the missing directories were the ones which fsck had > complained about. I restored whatever I could from some backups. > > To confirm this as a one off incident and my disk hasn't gone bad I ran > SMART tests, (this is a few month old drive) > # smartctl -t long /dev/sdb > > But after the test I can't understand the output of the logs, > > >> # smartctl -a /dev/sdb >> smartctl 5.39.1 2010-01-28 r3054 [x86_64-redhat-linux-gnu] (local build) >> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net >> >> === START OF INFORMATION SECTION === >> Model Family: Western Digital Caviar Black family >> Device Model: WDC WD1001FALS-00E8B0 >> Serial Number: WD-WMATV5966482 >> Firmware Version: 05.00K05 >> User Capacity: 1,000,204,886,016 bytes >> Device is: In smartctl database [for details use: -P show] >> ATA Version is: 8 >> ATA Standard is: Exact ATA specification draft version not indicated >> Local Time is: Sat Aug 14 19:37:26 2010 PDT >> SMART support is: Available - device has SMART capability. >> SMART support is: Enabled >> >> === START OF READ SMART DATA SECTION === >> SMART overall-health self-assessment test result: PASSED >> >> General SMART Values: >> Offline data collection status: (0x84) Offline data collection activity >> was suspended by an interrupting command from host. >> Auto Offline Data Collection: Enabled. >> Self-test execution status: ( 121) The previous self-test completed having >> the read element of the test failed. >> Total time to complete Offline >> data collection: (18000) seconds. >> Offline data collection >> capabilities: (0x7b) SMART execute Offline immediate. >> Auto Offline data collection on/off support. >> Suspend Offline collection upon new >> command. >> Offline surface scan supported. >> Self-test supported. >> Conveyance Self-test supported. >> Selective Self-test supported. >> SMART capabilities: (0x0003) Saves SMART data before entering >> power-saving mode. >> Supports SMART auto save timer. >> Error logging capability: (0x01) Error logging supported. >> General Purpose Logging supported. >> Short self-test routine >> recommended polling time: ( 2) minutes. >> Extended self-test routine >> recommended polling time: ( 208) minutes. >> Conveyance self-test routine >> recommended polling time: ( 5) minutes. >> SCT capabilities: (0x3037) SCT Status supported. >> SCT Feature Control supported. >> SCT Data Table supported. >> >> SMART Attributes Data Structure revision number: 16 >> Vendor Specific SMART Attributes with Thresholds: >> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE >> 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 1354 >> 3 Spin_Up_Time 0x0027 253 253 021 Pre-fail Always - 1158 >> 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 40 >> 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 >> 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 >> 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1403 >> 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 >> 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 >> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 38 >> 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 21 >> 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 18 >> 194 Temperature_Celsius 0x0022 112 107 000 Old_age Always - 38 >> 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 >> 197 Current_Pending_Sector 0x0032 199 199 000 Old_age Always - 172 >> 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 >> 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 >> 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 >> >> SMART Error Log Version: 1 >> No Errors Logged >> >> SMART Self-test log structure revision number 1 >> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error >> # 1 Extended offline Completed: read failure 90% 1393 1106820646 >> >> SMART Selective self-test log data structure revision number 1 >> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS >> 1 0 0 Not_testing >> 2 0 0 Not_testing >> 3 0 0 Not_testing >> 4 0 0 Not_testing >> 5 0 0 Not_testing >> Selective self-test flags (0x0): >> After scanning selected spans, do NOT read-scan remainder of disk. >> If Selective self-test is pending on power-up, resume after 0 minute delay. >> > > All the values in the table above seems larger than the threshold. But > the report says PASSED. I'm not clear how to interpret this. Could > someone help? Thanks a lot in advance. > > Got a good backup of this drive? Looks like it needs to be retested, in a different machine and if it fails, replaced. I had a drive that exhibited the same behavior and eventually, it failed. James McKenzie -- users mailing list users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe or change subscription options: https://admin.fedoraproject.org/mailman/listinfo/users Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines