Fedora Users — Re: Problems with FC4 kernels 1658 and 1831

On Thu, Feb 09, 2006 at 11:15:37AM -0800, Peter J. Stieber wrote:

PS = Peter Stieber
PS>> I have three different machines that have been having
PS>> problems with the last two FC4 kernels. The first is a
PS>> 733 MHz Pentium III with 256 MB RAM
PS>> and a 230 GB IDE disk. It acts an Apache web
PS>> server and a subversion server. The machine gets
PS>> into a mode where it prints the following on
PS>> the console (the numbers may be a little different):
PS>>
PS>> Normal per-cpu:
PS>> cpu 0 hot: low 62, high 186, batch 31, used 99
PS>> cpu 0 cold: low 0, high 62, batch 31, used 57
PS>> HighMem per-cpu: empty
PS>> Free pages:           2972kB (0kB HighMem)

PS>> Active:0 inactive:0 dirty:0 writeback:0 stable:0 free:743slab:16038

PS>> mapped:22857 pagetables:22109
PS>> DMA: free:1080kB  min:124kB low:152kB high:184kB active:120kB

PS>> inactive:264kB present:16384kB pages_scanned:1840433all_unreclaimable?

PS>> yes
PS>> lowmem_reserve[]: 0 239 239

PS>> Normal free:1892kB min:1916kB low:2392kB high:2872kBactive:13492kB

PS>> inactive:13320kB present:245680kB pages_scanned:787375
PS>> all_unreclaimable? yes
PS>> lowmem_reserve[]: 0 0 0

PS>> HighMem free:0kB min:128kB low:160kB high:192kB active:0kBinactive:0kB

PS>> present:0kB pages_scanned:0 all_unreclaimable? no
PS>> lowmem_reserve[]: 0 0 0

PS>> DMA: 0*4kB 1*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 0*512kB1*1024KB

PS>> 0*2048kB 0*4096kB = 1080kB
PS>> Normal: 1*4kB 0*8kB 2*16kB 2*32kB 0*64kB 4*128kB 1*256kB 0*512kB
PS>> 1*1024KB 0*2048kB 0*4096kB = 1892kB
PS>> HighMem: empty

PS>> Swap cache: add 222217, delete 221977, find 1020061/1033074, race0+2

PS>> Free swap = 0kB
PS>> Total swap = 524280kB
PS>>

PS>> Notice the machine is out of swap space so there seems to be aprocessPS>> that is running amuck, but I can't tell what it is. I have to shutthe

PS>> machine off to make progress. This also occurred with 1831.

DJ = Dave Jones

DJ> When this gets printed, there should also be something in the logslikeDJ> "kernel: killed process foo", which should be the app that wasconsuming

DJ> your memory.

I'm looking in /var/log/messages. I think this is the type of messagesyou are asking about...

Feb 9 09:39:42 homer kernel: Out of Memory: Killed process 1686(httpd).


There are a bunch related to the apache daemon, but there are others.

Feb 9 10:32:24 homer kernel: Out of Memory: Killed process 12852(AutoCleanAll.sh).


This is a script of mine.

Feb 9 10:32:26 homer kernel: Out of Memory: Killed process 23035(unicode_start).


There are a few of these unicode_start messages.

I also see the following just prior to the problem starting...

Feb  9 04:03:49 homer init: Trying to re-exec init

Feb 9 08:37:44 homer login(pam_unix)[1768]: session opened for userpstieber by (uid=0)

Feb  9 08:37:45 homer ainit: Memory: Failed to release semaphore
Feb  9 08:37:45 homer ainit: Error: No such file or directory
Feb  9 08:37:45 homer ainit: Memory: Failed to release SHM segment
Feb  9 08:37:45 homer ainit: Error: No such file or directory
Feb  9 08:37:45 homer ainit: Memory: Failed to release SHM segment
Feb  9 08:37:45 homer ainit: Error: No such file or directory
Feb  9 08:37:45 homer ainit: No such file or directory
Feb  9 08:37:45 homer ainit: Memory: Failed to release semaphore
Feb  9 08:37:45 homer ainit: Error: No such file or directory
Feb  9 08:37:45 homer ainit: Memory: Failed to release SHM segment
Feb  9 08:37:45 homer ainit: Error: No such file or directory
Feb  9 08:37:45 homer ainit: Memory: Failed to release SHM segment
Feb  9 08:37:45 homer ainit: Error: No such file or directory
Feb  9 08:37:45 homer ainit: No such file or directory
Feb  9 08:37:45 homer  -- pstieber[1768]: LOGIN ON tty2 BY pstieber

Feb 9 08:43:13 homer sshd(pam_unix)[12680]: session opened for userpstieber by (uid=0)Feb 9 08:43:46 homer sshd(pam_unix)[12680]: session closed for userpstieberFeb 9 08:43:46 homer sshd(pam_unix)[12705]: session opened for userpstieber by (uid=0)Feb 9 08:45:40 homer sshd(pam_unix)[12705]: session closed for userpstieber

Feb  9 09:39:33 homer kernel: oom-killer: gfp_mask=0x201d2, order=0
Feb  9 09:39:33 homer kernel: Mem-info:
Feb  9 09:39:33 homer kernel: DMA per-cpu:

That's Alsa related.

Here are all of the "Killed" messages from the dual opteron machine...

Feb 8 17:27:40 maggie kernel: Out of Memory: Killed process 2572(httpd).Feb 8 17:31:55 maggie kernel: Out of Memory: Killed process 2573(httpd).Feb 8 17:31:58 maggie kernel: Out of Memory: Killed process 2574(httpd).Feb 8 17:32:02 maggie kernel: Out of Memory: Killed process 2575(httpd).Feb 8 17:39:25 maggie kernel: Out of Memory: Killed process 2576(httpd).Feb 8 17:41:00 maggie kernel: Out of Memory: Killed process 2577(httpd).Feb 8 17:41:04 maggie kernel: Out of Memory: Killed process 2578(httpd).Feb 8 17:41:10 maggie kernel: Out of Memory: Killed process 2579(httpd).Feb 8 17:41:13 maggie kernel: Out of Memory: Killed process 3232(bash).Feb 8 17:50:17 maggie kernel: Out of Memory: Killed process 10658(BuildSlamemClus).Feb 8 17:52:59 maggie kernel: Out of Memory: Killed process 24696(unicode_start).Feb 8 17:53:28 maggie kernel: Out of Memory: Killed process 26200(unicode_start).Feb 8 17:53:32 maggie kernel: Out of Memory: Killed process 3968(unicode_start).Feb 8 17:54:02 maggie kernel: Out of Memory: Killed process 3945(unicode_start).

It looks like it tries to kill any process that is running when thesystem runs out of resources, so it may not be the listed processes.This makes sense because the machines never recover once they are inthis state. If it kills the problem process you would think the machinewould recover.

The third machine actually recovered from it's problem. It was running2.6.14-1.1656_FC4. It looks like CVS was causing the problem (thismachine is a CVS server, but that is just hard to believe.


Feb  6 07:09:20 marge kernel: Out of Memory: Killed process 28480 (cvs).
Feb  6 10:50:27 marge kernel: Out of Memory: Killed process 30501 (cvs).

PS>> The third machine is a dual Tyan Dual Opteron machine.
PS>> It experienced the problem with a 1831 when trying to
PS>> compile a large code.
PS>>
PS>> Has anyone experienced similar problems?

DJ> There were some leaks in the older kernels that
DJ> could explain it, but they usually manifested
DJ> themselves slightly differently. 1831 has them
DJ> fixed though (though there may still be some
DJ> undiscovered, and there's one known problem
DJ> that I was made aware a day or so ago
DJ> where SELinux leaks memory. I'll get a test kernel
DJ> with that fix built after 2.6.15.4 is released).

Thanks. Is there anything else I can do to help debug?

Pete