Fedora Users — Re: kernel crash

On Tue, 2010-08-17 at 12:48 -0400, Steve Blackwell wrote:
> On Tue, 17 Aug 2010 12:05:44 -0400
> Steve Blackwell <zephod@xxxxxxxxxx> wrote:
> 
> > On Tue, 17 Aug 2010 18:07:18 +0300
> > Gilboa Davara <gilboad@xxxxxxxxx> wrote:
> > 
> > > On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote:
> > > > I leave my computer on 24/7 so that my backups can run at night.
> > > > Lately, it has been crashing during the night usually leaving no
> > > > trace of what happened. Last night it crashed but left this
> > > > in /var/log/messages:
> > > > 
> > > > Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked
> > > > for more than 120 seconds. Aug 17 01:04:56 steve kernel: "echo 0
> > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > Could a hard drive get shut down because it was getting too hot?
> > > > > What would be a normal temp for a hard drive that has just
> > > > > completed a backup? 124C seems really hot. The HD cooling fan
> > > > > had been  broken so I replaced it this past weekend but it
> > > > > doesn't seem to have helped. Too late? Permanent HD damage
> > > > > already done?
> > > > Any other comments or suggestions?
> > > 
> > > Hello Steve,
> > > 
> > > This is not a crash.
> > > The kjournald kernel process (which handles various file-system
> > > task). You assumption that the HD went into some type of
> > > sleep/suspend mode during write sounds reasonable to me.
> > > 
> > > 124C seems -very- hot. Even during heavy I/O.
> > > Two things spring into mind:
> > > A. Is it a normal desktop SATA drive or high-speed SCSI/SAS drive?
> > > B. Please post the SMART log of the drive. (smartctl -a /dev/sdX). 
> > > 
> > > - Gilboa
> > > 
> > 
> > Hello Gilboa,
> > 
> > Yes I realize that it was not a crash. When I first saw the kernel
> > messages I thought it was and started writing the e-mail. I neglected
> > to correct the subject line after I actually read the messages. Sorry
> > about that.
> > 
> > I had already run the command:
> > smartctl -t long /dev/sdb
> > before I got your reply. The results should be ready soon.
> > 
> > I've been looking at my logs some more. I don't understand these
> > messages:
> > 
> > Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu
> > clock throttled (total events = 455) 
> > Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu
> > clock throttled (total events = 455) 
> > Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal 
> > Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal
> > 
> > These messages are repeated every hour or so. It seems unlikely that
> > every time the threshold is exceeded, it immediately (within one
> > second) drops back again. What is going on here?
> > 
> > The drive is an old IDE drive: WDC WD1600JB-00F
> > 
> > Thanks,
> > Steve
> 
> Well, the long self test passed.
> Here is the result of 
> # smartctl -a /dev/sdb
> smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
> 
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Caviar SE family
> Device Model:     WDC WD1600JB-00FUA0
> Serial Number:    WD-WCAES1024695
> Firmware Version: 15.05R15
> User Capacity:    160,041,885,696 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   6
> ATA Standard is:  Exact ATA specification draft version not indicated
> Local Time is:    Tue Aug 17 12:36:35 2010 EDT
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x85)	Offline data collection activity
> 					was aborted by an interrupting command from host.
> 					Auto Offline Data Collection: Enabled.
> Self-test execution status:      (   0)	The previous self-test routine completed
> 					without error or no self-test has ever 
> 					been run.
> Total time to complete Offline 
> data collection: 		 (5073) seconds.
> Offline data collection
> capabilities: 			 (0x79) SMART execute Offline immediate.
> 					No Auto Offline data collection support.
> 					Suspend Offline collection upon new
> 					command.
> 					Offline surface scan supported.
> 					Self-test supported.
> 					Conveyance Self-test supported.
> 					Selective Self-test supported.
> SMART capabilities:            (0x0003)	Saves SMART data before entering
> 					power-saving mode.
> 					Supports SMART auto save timer.
> Error logging capability:        (0x01)	Error logging supported.
> 					No General Purpose Logging support.
> Short self-test routine 
> recommended polling time: 	 (   2) minutes.
> Extended self-test routine
> recommended polling time: 	 (  67) minutes.
> Conveyance self-test routine
> recommended polling time: 	 (   5) minutes.
> 
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always       -       0
>   3 Spin_Up_Time            0x0007   146   142   021    Pre-fail  Always       -       3233
>   4 Start_Stop_Count        0x0032   099   099   040    Old_age   Always       -       1681
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
>   7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always       -       0
>   9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       22478
>  10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
>  11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
>  12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1654
> 194 Temperature_Celsius     0x0022   116   253   000    Old_age   Always       -       34
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
> 197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
> 198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0
> 199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       1
> 200 Multi_Zone_Error_Rate   0x0009   200   155   051    Pre-fail  Offline      -       0
> 
> SMART Error Log Version: 1
> No Errors Logged
> 
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
> # 1  Extended offline    Completed without error       00%       632         -
> # 2  Short offline       Completed without error       00%       696         -
> 
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute
> delay.
> 
> This doesn't make much sense to me. If the overall health status id PASSED then why are all the vendor specific threshold values exceeded? Am I reading that wrong?
> 
> Thanks,
> Steve

The drive seems OK.
I'd look at the machine's cooling (see my suggest concerning
lm_sensros).
(Even the UDMA CRC Error might be attributed to high temperature)

- Gilboa


-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines