Re: OOM killer gripe (was Re: What still uses the block layer?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Oct 15, 2007 at 11:37:44PM +1000, Nick Piggin wrote:
> I hate to go completely offtopic here, but disks are so incredibly
> slow when compared to RAM that there is really nothing the kernel
> can do about this. Presumably the job will finish, given infinite
> time.

About 6 weeks ago, on a 2.6.23-rc kernel, I accidentally typed "make
-j", and left off the 4 before I hit the return key.  About 2-3
minutes later, the box locked pretty tight.  I managed to switch to a
VT console before I lost total control of X (took many, many minutes
to do the switch), but after many minutes, managed to get logged into
the console, but I wasn't able to get a ps command to complete so I
could start killing processes.  (I probably should have just done a
"killall make" right away, but hindsight is 20/20.)

The console was showing that the OOM killer was attempting to kill
processes, but apparently not fast enough to stem the tide of all of
the new processes getting generated by the make -j.  (I'm guessing
that it was killing the gcc processes and not the make processes.)

> Would an oom-kill-someone-now sysrq be of help, I wonder?

I tried sysrq-f (oom_kill), but no dice.  Given that the oom killer
was active and apparently triggering on its own, this wasn't all that
surprising.

The interesting thing is I tried to do an sysrq-e (send SIGTERM to all
processes except), waited 5 minutes or so, then tried an alt-sysrq-i
(send SIGKILL to all processes except init), and the system was still
thrashing itself to death, even after giving it plenty of time to try
to recover.

I finally gave up and held down the power button.  This was on a box
with 4 gigs memory (but only 3 gigs visible thanks a cheap
BIOS/chipset) and 4 gigs swap (mainly intended for suspend/resume).

I chalked it up to me being stupid (I should have noticed and
Ctrl-C'ed the make -j much more quickly, or if I were a sysadmin on a
time-sharing system with users I didn't trust, configured RLIMIT_NPROC
and/or per-user container resource limits) and the OOM killer not
being aggressive enough in such a situation.  But having better things
to do, I didn't go whining on LKML about it, although I have to say
that the kernel behavior isn't exactly ideal.  One of these days when
I have time, I'll try investigating it with a few memlocked processes
running at real-time priorities and Systemtap and figure out what the
heck was going on....

I suppose I should just configure suspending to a file instead of a
swap partition, but I've just historically trusted suspend/resume to a
swap partition much more than to a file.  Or maybe I should hack in a
sysctl to prevent any swapping even though the swap partition is
configured (so only suspend/resume will use it).

					- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux