On Thu, Feb 09, 2006 at 11:15:37AM -0800, Peter J. Stieber wrote:
PS = Peter Stieber
PS>> I have three different machines that have been having
PS>> problems with the last two FC4 kernels. The first is a
PS>> 733 MHz Pentium III with 256 MB RAM
PS>> and a 230 GB IDE disk. It acts an Apache web
PS>> server and a subversion server. The machine gets
PS>> into a mode where it prints the following on
PS>> the console (the numbers may be a little different):
PS>>
PS>> Normal per-cpu:
PS>> cpu 0 hot: low 62, high 186, batch 31, used 99
PS>> cpu 0 cold: low 0, high 62, batch 31, used 57
PS>> HighMem per-cpu: empty
PS>> Free pages: 2972kB (0kB HighMem)
PS>> Active:0 inactive:0 dirty:0 writeback:0 stable:0 free:743
slab:16038
PS>> mapped:22857 pagetables:22109
PS>> DMA: free:1080kB min:124kB low:152kB high:184kB active:120kB
PS>> inactive:264kB present:16384kB pages_scanned:1840433
all_unreclaimable?
PS>> yes
PS>> lowmem_reserve[]: 0 239 239
PS>> Normal free:1892kB min:1916kB low:2392kB high:2872kB
active:13492kB
PS>> inactive:13320kB present:245680kB pages_scanned:787375
PS>> all_unreclaimable? yes
PS>> lowmem_reserve[]: 0 0 0
PS>> HighMem free:0kB min:128kB low:160kB high:192kB active:0kB
inactive:0kB
PS>> present:0kB pages_scanned:0 all_unreclaimable? no
PS>> lowmem_reserve[]: 0 0 0
PS>> DMA: 0*4kB 1*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB
1*1024KB
PS>> 0*2048kB 0*4096kB = 1080kB
PS>> Normal: 1*4kB 0*8kB 2*16kB 2*32kB 0*64kB 4*128kB 1*256kB 0*512kB
PS>> 1*1024KB 0*2048kB 0*4096kB = 1892kB
PS>> HighMem: empty
PS>> Swap cache: add 222217, delete 221977, find 1020061/1033074, race
0+2
PS>> Free swap = 0kB
PS>> Total swap = 524280kB
PS>>
PS>> Notice the machine is out of swap space so there seems to be a
process
PS>> that is running amuck, but I can't tell what it is. I have to shut
the
PS>> machine off to make progress. This also occurred with 1831.
DJ = Dave Jones
DJ> When this gets printed, there should also be something in the logs
like
DJ> "kernel: killed process foo", which should be the app that was
consuming
DJ> your memory.
I'm looking in /var/log/messages. I think this is the type of messages
you are asking about...
Feb 9 09:39:42 homer kernel: Out of Memory: Killed process 1686
(httpd).
There are a bunch related to the apache daemon, but there are others.
Feb 9 10:32:24 homer kernel: Out of Memory: Killed process 12852
(AutoCleanAll.sh).
This is a script of mine.
Feb 9 10:32:26 homer kernel: Out of Memory: Killed process 23035
(unicode_start).
There are a few of these unicode_start messages.
I also see the following just prior to the problem starting...
Feb 9 04:03:49 homer init: Trying to re-exec init
Feb 9 08:37:44 homer login(pam_unix)[1768]: session opened for user
pstieber by (uid=0)
Feb 9 08:37:45 homer ainit: Memory: Failed to release semaphore
Feb 9 08:37:45 homer ainit: Error: No such file or directory
Feb 9 08:37:45 homer ainit: Memory: Failed to release SHM segment
Feb 9 08:37:45 homer ainit: Error: No such file or directory
Feb 9 08:37:45 homer ainit: Memory: Failed to release SHM segment
Feb 9 08:37:45 homer ainit: Error: No such file or directory
Feb 9 08:37:45 homer ainit: No such file or directory
Feb 9 08:37:45 homer ainit: Memory: Failed to release semaphore
Feb 9 08:37:45 homer ainit: Error: No such file or directory
Feb 9 08:37:45 homer ainit: Memory: Failed to release SHM segment
Feb 9 08:37:45 homer ainit: Error: No such file or directory
Feb 9 08:37:45 homer ainit: Memory: Failed to release SHM segment
Feb 9 08:37:45 homer ainit: Error: No such file or directory
Feb 9 08:37:45 homer ainit: No such file or directory
Feb 9 08:37:45 homer -- pstieber[1768]: LOGIN ON tty2 BY pstieber
Feb 9 08:43:13 homer sshd(pam_unix)[12680]: session opened for user
pstieber by (uid=0)
Feb 9 08:43:46 homer sshd(pam_unix)[12680]: session closed for user
pstieber
Feb 9 08:43:46 homer sshd(pam_unix)[12705]: session opened for user
pstieber by (uid=0)
Feb 9 08:45:40 homer sshd(pam_unix)[12705]: session closed for user
pstieber
Feb 9 09:39:33 homer kernel: oom-killer: gfp_mask=0x201d2, order=0
Feb 9 09:39:33 homer kernel: Mem-info:
Feb 9 09:39:33 homer kernel: DMA per-cpu:
That's Alsa related.
Here are all of the "Killed" messages from the dual opteron machine...
Feb 8 17:27:40 maggie kernel: Out of Memory: Killed process 2572
(httpd).
Feb 8 17:31:55 maggie kernel: Out of Memory: Killed process 2573
(httpd).
Feb 8 17:31:58 maggie kernel: Out of Memory: Killed process 2574
(httpd).
Feb 8 17:32:02 maggie kernel: Out of Memory: Killed process 2575
(httpd).
Feb 8 17:39:25 maggie kernel: Out of Memory: Killed process 2576
(httpd).
Feb 8 17:41:00 maggie kernel: Out of Memory: Killed process 2577
(httpd).
Feb 8 17:41:04 maggie kernel: Out of Memory: Killed process 2578
(httpd).
Feb 8 17:41:10 maggie kernel: Out of Memory: Killed process 2579
(httpd).
Feb 8 17:41:13 maggie kernel: Out of Memory: Killed process 3232
(bash).
Feb 8 17:50:17 maggie kernel: Out of Memory: Killed process 10658
(BuildSlamemClus).
Feb 8 17:52:59 maggie kernel: Out of Memory: Killed process 24696
(unicode_start).
Feb 8 17:53:28 maggie kernel: Out of Memory: Killed process 26200
(unicode_start).
Feb 8 17:53:32 maggie kernel: Out of Memory: Killed process 3968
(unicode_start).
Feb 8 17:54:02 maggie kernel: Out of Memory: Killed process 3945
(unicode_start).
It looks like it tries to kill any process that is running when the
system runs out of resources, so it may not be the listed processes.
This makes sense because the machines never recover once they are in
this state. If it kills the problem process you would think the machine
would recover.
The third machine actually recovered from it's problem. It was running
2.6.14-1.1656_FC4. It looks like CVS was causing the problem (this
machine is a CVS server, but that is just hard to believe.
Feb 6 07:09:20 marge kernel: Out of Memory: Killed process 28480 (cvs).
Feb 6 10:50:27 marge kernel: Out of Memory: Killed process 30501 (cvs).
PS>> The third machine is a dual Tyan Dual Opteron machine.
PS>> It experienced the problem with a 1831 when trying to
PS>> compile a large code.
PS>>
PS>> Has anyone experienced similar problems?
DJ> There were some leaks in the older kernels that
DJ> could explain it, but they usually manifested
DJ> themselves slightly differently. 1831 has them
DJ> fixed though (though there may still be some
DJ> undiscovered, and there's one known problem
DJ> that I was made aware a day or so ago
DJ> where SELinux leaks memory. I'll get a test kernel
DJ> with that fix built after 2.6.15.4 is released).
Thanks. Is there anything else I can do to help debug?
Pete