Fedora Users — Re: kernel crash

On Tue, 17 Aug 2010 12:05:44 -0400
Steve Blackwell <zephod@xxxxxxxxxx> wrote:

> On Tue, 17 Aug 2010 18:07:18 +0300
> Gilboa Davara <gilboad@xxxxxxxxx> wrote:
> 
> > On Tue, 2010-08-17 at 09:44 -0400, Steve Blackwell wrote:
> > > I leave my computer on 24/7 so that my backups can run at night.
> > > Lately, it has been crashing during the night usually leaving no
> > > trace of what happened. Last night it crashed but left this
> > > in /var/log/messages:
> > > 
> > > Aug 17 01:04:56 steve kernel: INFO: task kjournald:1960 blocked
> > > for more than 120 seconds. Aug 17 01:04:56 steve kernel: "echo 0
> > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > Could a hard drive get shut down because it was getting too hot?
> > > > What would be a normal temp for a hard drive that has just
> > > > completed a backup? 124C seems really hot. The HD cooling fan
> > > > had been  broken so I replaced it this past weekend but it
> > > > doesn't seem to have helped. Too late? Permanent HD damage
> > > > already done?
> > > Any other comments or suggestions?
> > 
> > Hello Steve,
> > 
> > This is not a crash.
> > The kjournald kernel process (which handles various file-system
> > task). You assumption that the HD went into some type of
> > sleep/suspend mode during write sounds reasonable to me.
> > 
> > 124C seems -very- hot. Even during heavy I/O.
> > Two things spring into mind:
> > A. Is it a normal desktop SATA drive or high-speed SCSI/SAS drive?
> > B. Please post the SMART log of the drive. (smartctl -a /dev/sdX). 
> > 
> > - Gilboa
> > 
> 
> Hello Gilboa,
> 
> Yes I realize that it was not a crash. When I first saw the kernel
> messages I thought it was and started writing the e-mail. I neglected
> to correct the subject line after I actually read the messages. Sorry
> about that.
> 
> I had already run the command:
> smartctl -t long /dev/sdb
> before I got your reply. The results should be ready soon.
> 
> I've been looking at my logs some more. I don't understand these
> messages:
> 
> Aug 17 10:30:50 steve kernel: CPU0: Temperature above threshold, cpu
> clock throttled (total events = 455) 
> Aug 17 10:30:50 steve kernel: CPU1: Temperature above threshold, cpu
> clock throttled (total events = 455) 
> Aug 17 10:30:50 steve kernel: CPU1: Temperature/speed normal 
> Aug 17 10:30:50 steve kernel: CPU0: Temperature/speed normal
> 
> These messages are repeated every hour or so. It seems unlikely that
> every time the threshold is exceeded, it immediately (within one
> second) drops back again. What is going on here?
> 
> The drive is an old IDE drive: WDC WD1600JB-00F
> 
> Thanks,
> Steve

Well, the long self test passed.
Here is the result of 
# smartctl -a /dev/sdb
smartctl 5.39.1 2010-01-28 r3054 [i386-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar SE family
Device Model:     WDC WD1600JB-00FUA0
Serial Number:    WD-WCAES1024695
Firmware Version: 15.05R15
User Capacity:    160,041,885,696 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Aug 17 12:36:35 2010 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
					was aborted by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 (5073) seconds.
Offline data collection
capabilities: 			 (0x79) SMART execute Offline immediate.
					No Auto Offline data collection support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					No General Purpose Logging support.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 (  67) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0007   146   142   021    Pre-fail  Always       -       3233
  4 Start_Stop_Count        0x0032   099   099   040    Old_age   Always       -       1681
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       22478
 10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
 11 Calibration_Retry_Count 0x0013   100   100   051    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1654
194 Temperature_Celsius     0x0022   116   253   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0012   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x000a   200   253   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0009   200   155   051    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       632         -
# 2  Short offline       Completed without error       00%       696         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute
delay.

This doesn't make much sense to me. If the overall health status id PASSED then why are all the vendor specific threshold values exceeded? Am I reading that wrong?

Thanks,
Steve
-- 
Changing lives one card at a time

http://www.send1cardnow.com

Attachment: signature.asc
Description: PGP signature

-- 
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://admin.fedoraproject.org/mailman/listinfo/users
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines