Fedora Users — Re: Need help with Reboot cause

On Tue, Apr 07, 2009 at 10:41:43AM -0700, Peter J. Stieber wrote:
> PS = Pete Stieber
> PS>> I have a dual opteron system that has been acting as
> PS>> the worldly node for a small cluster of computers
> PS>> since September, 2004.  The machine is running the
> PS>> latest x86_64 Fedora 10 kernel that I recently loaded
> PS>> (April 2).  The machine reboots without warning.  I
> PS>> can't find the cause in log files (maybe I'm not
> PS>> looking in the correct log).
> PS>>
> PS>> I'm currently running memtest.  If all of the tests
> PS>> pass, could the community suggest other diagnostic
> PS>> tasks or information I could post to help diagnose the
> PS>> problem?
>
> m> Have you tried going back to the previous kernel?
>
> The machine is still running memtest (no errors so far), but I already  
> removed the prior kernel.  I did notice reboots with the prior kernel.  
> BTW my current kernel is 2.6.27.21-170.2.56.fc10.x86_64.
>
If it reboots with prior kernels then i would do a thorough check of the hardware first but you may look for known issues reported against your particular hardware setup, since it may be a known issue

> Reboots indicated by information in /var/log/messages...
>
> Sunday    March 29   4:08
> Tuesday   March 31   7:02
> Thursday  April  2  18:27 Intentional reboot due to new kernel
> Friday    April  3   1:36
> Sunday    April  5   1:37
> Sunday    April  5   2:48
> Sunday    April  5   9:43
> Sunday    April  5  13:20 as I was typing this email
>
> m> Did you check dmesg and /var/log/messages?
>
> Yes.  I can see reboots, but not the cause.
>
> m> Does it boot normally and then just fail at some random
> m> interval or is it consistently failing at the same point?
>
> I have had top running during a few of the reboots.  I have forced a  
> couple of them by starting my nightly build process.  The linker/loader  
> has been running during some of the reboots...

>
> top - 13:19:53 up  3:36,  6 users,  load average: 1.27, 2.70, 2.32
> Tasks: 138 total,   6 running, 132 sleeping,   0 stopped,   0 zombie
> Cpu(s): 40.8%us, 13.8%sy,  0.0%ni, 42.5%id,  2.7%wa,  0.0%hi,  0.3%si,  
> 0.0%st
> Mem:   2060232k total,  1683996k used,   376236k free,   164484k buffers
> Swap:  2031608k total,       56k used,  2031552k free,  1230796k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  8878 pstieber  20   0 34552  25m 1096 R  7.6  1.3   0:00.23 ld
>  8884 pstieber  20   0 48284  27m 1080 R  5.0  1.4   0:00.15 ld
>     7 root      15  -5     0    0    0 S  0.3  0.0   0:00.17 ksoftirqd/1
> 22427 pstieber  20   0 14880 1208  872 R  0.3  0.1   0:03.49 top
>     1 root      20   0  4096  876  616 S  0.0  0.0   0:00.71 init
>
> Another instance
>
> top - 06:55:13 up 17:34,  2 users,  load average: 2.83, 2.59, 1.86
> Tasks: 127 total,   2 running, 125 sleeping,   0 stopped,   0 zombie
> Cpu(s): 45.1%us,  4.7%sy,  0.0%ni, 49.8%id,  0.5%wa,  0.0%hi,  0.0%si,  
> 0.0%st
> Mem:   2060232k total,  1763404k used,   296828k free,   177052k buffers
> Swap:  2031608k total,       56k used,  2031552k free,  1271964k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>  5757 pstieber  20   0 79788  69m 1080 R 12.3  3.5   0:00.37 ld
>     1 root      20   0  4096  876  616 S  0.0  0.0   0:00.68 init
>     2 root      15  -5     0    0    0 S  0.0  0.0   0:00.00 kthreadd
>
> I'm not sure this is always the case.
>
Might be worth finding out...

> m> Other things you may consider:
> m> CPU type?
>
> Motherboard: Tyan Thunder K8W (S2885ANRF)
> CPUs: Dual Opteron 244 (1.8 GHz) processors
> Memory: 2 GB   4-512MB  CT6472Y40B  DDR PC3200 from Crucial
>
> m> temperature?
>
> Is there a command to monitor this while running the OS?

there is a gnome widget for this or there was and it required some configuration...from the CLI i am not sure how to go about it but usually the BIOS has the temp and this will be good enough to start with
>
> m> potential hard drive issue?
>
> I have 3 SATA drives running.  It's been so long since I have done this,  
> but how does one manually do a disk chack?
>

 I think you would do better with a dedicated hard drive test like Hitachi makes available, but i am forgetting about smartctl!! Still two sets of independent results are better than one so maybe do both if you have the time. I usually start with hitachi (works with non hitachi drives) and if that passes I move on to try other things but fsck first.

man fsck
man smartctl
> m> any new hardware attached or installed recently?
>
> No
>
> m> Notice any power surges or brownouts?
>
> The machine is on a UPS that deals with this.
>
> m> any other nodes having issues?
>
> No and they are not on UPSs.  They also do not have as large of a work load.
>
> The machine in question is used for nightly builds and regression tests.  
>  I use distcc with the compute nodes to perform the builds.
>
> The machine also runs samba to provide a network share to Windows users  
> and provides authentication using Windows domain accounts.
>
> m> Recent power surge zapped a board, DSL modem,
> m> and the surge protector.
>
> I doubt this is the problem.
>
> Memtest make it through the first pass of all test successfully.

Be sure to let it run for as long as you can 12 - 24 hours would be ideal, some errors don't show up right away or only with continous use
>
> Thanks for the suggestions, especially considering my vague information.
>

>
> -- 
> fedora-list mailing list
> fedora-list@xxxxxxxxxx
> To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list
> Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines

-- 
"Any fool can know. The point is to understand" --Albert Einstein

Bored??
http://fiction.wikia.com/wiki/Fuqwit1.0

http://fiction.wikia.com/wiki/Coding_the_Magic_into_the_Eight_Ball

-- 
fedora-list mailing list
fedora-list@xxxxxxxxxx
To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list
Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines