[rfc][patch] fixes for several oom killer problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]



We have reports of OOM killer panicing the system even if there are
tasks currently exiting and/or plenty able to be freed.

The main problem is the cpuset_excl_nodes_overlap causing an immediate
panic if current is exiting; I haven't got confirmation of whether
or not a minimal patch for that is effective.

The minimal patch basically involved ripping out the test completely.
I'd rather something more comprehensive in mainline, and I spotted
several other issues as well.


Fix several OOM killer problems.

Big ones:
- cpuset_excl_nodes_overlap always returns 0 if current is exiting. This
  caused customer's systems to panic in the OOM killer when processes were
  having trouble getting memory for the final put_user in mm_release. Even
  though there were lots of processes to kill. Fix this by just causing
  cpuset_excl_nodes_overlap to reduce the badness rather than disallow it
  (it may still be pinning memory somehow on this node or that this task
  may use).

- If current *is* exiting, it should actually be allowed to access reserved
  memory rather than OOM kill something else. Can't do this via a straight
  check in page_alloc.c because that would allow multiple tasks to use up
  reserves. Instead cause current to wind up marking itself as TIF_MEMDIE.

- In cpuset_excl_nodes_overlap, return 1 for PF_EXITING tasks. This retains
  parity with !CONFIG_CPUSETS case.

Little ones:
- PF_SWAPOFF processes cause select_bad_process to return straight away.
  Instead, give them high priority and ensure no parallel OOM kills are
  happening at the same time.

- cpuset_exlc_nodes_overlap may still free up some memory we're allowed to
  use. Kernel allocated memory, memory touched first by other processes or
  when we were in a different group. Cause this just to minimise the
  badness of a process.

- Skip kernel threads, rather than having them return 0 from badness.
  Theoretically, badness might truncate all results to 0, thus a kernel
  thread might be picked first, causing an infinite loop.

- Skip PF_DEAD tasks, for similar reasons.

- Print the name of the task that invoked the OOM killer. Could make
  debugging easier.

Signed-off-by: Nick Piggin <[email protected]>

Index: linux-2.6/mm/oom_kill.c
--- linux-2.6.orig/mm/oom_kill.c
+++ linux-2.6/mm/oom_kill.c
@@ -57,6 +57,12 @@ unsigned long badness(struct task_struct
+	 * swapoff can easily use up all memory, so kill those first.
+	 */
+	if (p->flags & PF_SWAPOFF)
+		return ULONG_MAX;
+	/*
 	 * The memory size of the process is the basis for the badness.
 	points = mm->total_vm;
@@ -125,6 +131,15 @@ unsigned long badness(struct task_struct
 	if (cap_t(p->cap_effective) & CAP_TO_MASK(CAP_SYS_RAWIO))
 		points /= 4;
+	/*
+	 * If p's nodes don't overlap ours, it may still help to kill p
+	 * because p may have allocated or otherwise mapped memory on
+	 * this node before. However it will be less likely.
+	 */
+	if (!cpuset_excl_nodes_overlap(p))
+		points /= 4;
 	 * Adjust the score by oomkilladj.
@@ -190,25 +205,35 @@ static struct task_struct *select_bad_pr
 		unsigned long points;
 		int releasing;
+		/* skip kernel threads */
+		if (!p->mm)
+			continue;
 		/* skip the init task with pid == 1 */
 		if (p->pid == 1)
-		if (p->oomkilladj == OOM_DISABLE)
-			continue;
-		/* If p's nodes don't overlap ours, it won't help to kill p. */
-		if (!cpuset_excl_nodes_overlap(p))
-			continue;
 		 * This is in the process of releasing memory so for wait it
 		 * to finish before killing some other task by mistake.
+		 *
+		 * However, if p is the current task, we allow the 'kill' to
+		 * go ahead if it is exiting: this will simply set TIF_MEMDIE,
+		 * which will allow it to gain access to memory reserves in
+		 * the process of exiting and releasing its resources.
 		releasing = test_tsk_thread_flag(p, TIF_MEMDIE) ||
 						p->flags & PF_EXITING;
-		if (releasing && !(p->flags & PF_DEAD))
+		if (releasing) {
+			/* PF_DEAD tasks have already released their mm */
+			if (p->flags & PF_DEAD)
+				continue;
+			if (p == current) {
+				chosen = p;
+				*ppoints = ULONG_MAX;
+				continue;
+			}
 			return ERR_PTR(-1UL);
-		if (p->flags & PF_SWAPOFF)
-			return p;
+		}
 		points = badness(p, uptime.tv_sec);
 		if (points > *ppoints || !chosen) {
@@ -216,6 +241,7 @@ static struct task_struct *select_bad_pr
 			*ppoints = points;
 	} while_each_thread(g, p);
 	return chosen;
@@ -319,8 +345,8 @@ void out_of_memory(struct zonelist *zone
 	unsigned long points = 0;
 	if (printk_ratelimit()) {
-		printk("oom-killer: gfp_mask=0x%x, order=%d\n",
-			gfp_mask, order);
+		printk(KERN_WARNING "%s invoked oom-killer: "
+			"gfp_mask=0x%x, order=%d, oomkilladj=%d\n", current->comm, gfp_mask, order, current->oomkilladj);
Index: linux-2.6/kernel/cpuset.c
--- linux-2.6.orig/kernel/cpuset.c
+++ linux-2.6/kernel/cpuset.c
@@ -2362,7 +2362,7 @@ EXPORT_SYMBOL_GPL(cpuset_mem_spread_node
 int cpuset_excl_nodes_overlap(const struct task_struct *p)
 	const struct cpuset *cs1, *cs2;	/* my and p's cpuset ancestors */
-	int overlap = 0;		/* do cpusets overlap? */
+	int overlap = 1;		/* do cpusets overlap? */
 	if (current->flags & PF_EXITING) {
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux