Tod Merley wrote:
I do not know that I make the cut of "highly technical" but when I
hear "fail under heavy load" I first think of heat. It might be
interesting to make a little chron job which simply logs CPU and MB
temps every minute or so. Maybe check the sink (well seated??).
These new servers run really cool compared to the the older Xeons,
Also the machine is in a Sungard Data, where temps are very cold
The machine did have IPMI card in it and I could monitor temps and that
reported hold, I
also set the fans to full speed just in case.
Apparently, the new FB-DIMM memory gets very hot, however, I know 100%
that this is not memory related, because the previous week I had to
replace 160G of FB-DIMM memory in 10 servers because of memory problems,
which was approved by Intel (on intel web site - only 5 companies are
approved).
(I won't name the company because the admitted the problem was there RAM,
Supermicro, then got me 160G of ram and tested it all before shipping it
to me, and that fixed the RAM errors.
I am in the process of making a little script to snap a picture of
what is going on in /var/log (and many parts of the machine as well).
I will post here when done if it still seems appropriate.
look forward to it.
For the truly dedicated to knowing "WHY?" The Linux a crash dump kit:
Hopefully I won't have to go this far :(
# One of many!
http://lkcd.sourceforge.net/
Good Hunting!
Tod
Thanks
Albert.