On Tue, Apr 07, 2009 at 10:41:43AM -0700, Peter J. Stieber wrote: > PS = Pete Stieber > PS>> I have a dual opteron system that has been acting as > PS>> the worldly node for a small cluster of computers > PS>> since September, 2004. The machine is running the > PS>> latest x86_64 Fedora 10 kernel that I recently loaded > PS>> (April 2). The machine reboots without warning. I > PS>> can't find the cause in log files (maybe I'm not > PS>> looking in the correct log). > PS>> > PS>> I'm currently running memtest. If all of the tests > PS>> pass, could the community suggest other diagnostic > PS>> tasks or information I could post to help diagnose the > PS>> problem? > > m> Have you tried going back to the previous kernel? > > The machine is still running memtest (no errors so far), but I already > removed the prior kernel. I did notice reboots with the prior kernel. > BTW my current kernel is 2.6.27.21-170.2.56.fc10.x86_64. > If it reboots with prior kernels then i would do a thorough check of the hardware first but you may look for known issues reported against your particular hardware setup, since it may be a known issue > Reboots indicated by information in /var/log/messages... > > Sunday March 29 4:08 > Tuesday March 31 7:02 > Thursday April 2 18:27 Intentional reboot due to new kernel > Friday April 3 1:36 > Sunday April 5 1:37 > Sunday April 5 2:48 > Sunday April 5 9:43 > Sunday April 5 13:20 as I was typing this email > > m> Did you check dmesg and /var/log/messages? > > Yes. I can see reboots, but not the cause. > > m> Does it boot normally and then just fail at some random > m> interval or is it consistently failing at the same point? > > I have had top running during a few of the reboots. I have forced a > couple of them by starting my nightly build process. The linker/loader > has been running during some of the reboots... > > top - 13:19:53 up 3:36, 6 users, load average: 1.27, 2.70, 2.32 > Tasks: 138 total, 6 running, 132 sleeping, 0 stopped, 0 zombie > Cpu(s): 40.8%us, 13.8%sy, 0.0%ni, 42.5%id, 2.7%wa, 0.0%hi, 0.3%si, > 0.0%st > Mem: 2060232k total, 1683996k used, 376236k free, 164484k buffers > Swap: 2031608k total, 56k used, 2031552k free, 1230796k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 8878 pstieber 20 0 34552 25m 1096 R 7.6 1.3 0:00.23 ld > 8884 pstieber 20 0 48284 27m 1080 R 5.0 1.4 0:00.15 ld > 7 root 15 -5 0 0 0 S 0.3 0.0 0:00.17 ksoftirqd/1 > 22427 pstieber 20 0 14880 1208 872 R 0.3 0.1 0:03.49 top > 1 root 20 0 4096 876 616 S 0.0 0.0 0:00.71 init > > Another instance > > top - 06:55:13 up 17:34, 2 users, load average: 2.83, 2.59, 1.86 > Tasks: 127 total, 2 running, 125 sleeping, 0 stopped, 0 zombie > Cpu(s): 45.1%us, 4.7%sy, 0.0%ni, 49.8%id, 0.5%wa, 0.0%hi, 0.0%si, > 0.0%st > Mem: 2060232k total, 1763404k used, 296828k free, 177052k buffers > Swap: 2031608k total, 56k used, 2031552k free, 1271964k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 5757 pstieber 20 0 79788 69m 1080 R 12.3 3.5 0:00.37 ld > 1 root 20 0 4096 876 616 S 0.0 0.0 0:00.68 init > 2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd > > I'm not sure this is always the case. > Might be worth finding out... > m> Other things you may consider: > m> CPU type? > > Motherboard: Tyan Thunder K8W (S2885ANRF) > CPUs: Dual Opteron 244 (1.8 GHz) processors > Memory: 2 GB 4-512MB CT6472Y40B DDR PC3200 from Crucial > > m> temperature? > > Is there a command to monitor this while running the OS? there is a gnome widget for this or there was and it required some configuration...from the CLI i am not sure how to go about it but usually the BIOS has the temp and this will be good enough to start with > > m> potential hard drive issue? > > I have 3 SATA drives running. It's been so long since I have done this, > but how does one manually do a disk chack? > I think you would do better with a dedicated hard drive test like Hitachi makes available, but i am forgetting about smartctl!! Still two sets of independent results are better than one so maybe do both if you have the time. I usually start with hitachi (works with non hitachi drives) and if that passes I move on to try other things but fsck first. man fsck man smartctl > m> any new hardware attached or installed recently? > > No > > m> Notice any power surges or brownouts? > > The machine is on a UPS that deals with this. > > m> any other nodes having issues? > > No and they are not on UPSs. They also do not have as large of a work load. > > The machine in question is used for nightly builds and regression tests. > I use distcc with the compute nodes to perform the builds. > > The machine also runs samba to provide a network share to Windows users > and provides authentication using Windows domain accounts. > > m> Recent power surge zapped a board, DSL modem, > m> and the surge protector. > > I doubt this is the problem. > > Memtest make it through the first pass of all test successfully. Be sure to let it run for as long as you can 12 - 24 hours would be ideal, some errors don't show up right away or only with continous use > > Thanks for the suggestions, especially considering my vague information. > > > -- > fedora-list mailing list > fedora-list@xxxxxxxxxx > To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list > Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines -- "Any fool can know. The point is to understand" --Albert Einstein Bored?? http://fiction.wikia.com/wiki/Fuqwit1.0 http://fiction.wikia.com/wiki/Coding_the_Magic_into_the_Eight_Ball -- fedora-list mailing list fedora-list@xxxxxxxxxx To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines