Re: PROBLEM: kernel BUG at mm/rmap.c:486 - kernel 2.6.15-r1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 1 Feb 2006, Ken MacFerrin wrote:
> 
> Well, unfortunately I'm back again.  I applied your patch last night but had
> another crash again today.  My X session crashed and dropped me into the
> console, which then froze, requiring a hard reboot.  I was only able to
> capture the output below because of remote logging.  This time I did not get
> the specific "BUG at mm/rmap.c" message I had in my previous report

Yes, that's replaced by "Bad rmap..." by my patch.

> but do have some "Bad rmap..." messages as you can see below.

Which in many cases allow the system to continue undisturbed;
but unfortunately not in your case, which is in a nastier state.
And only one "Bad rmap...", so not a lot I could glean from it.

> Feb  1 17:01:01 mm-home1 cron[31322]: (root) CMD (/usr/bin/updatedb)

Okay, so plenty of disk and cache activity then.
Were you doing anything interesting at the graphics end?

> Feb  1 17:04:13 mm-home1 __find_get_block_slow() failed. block=1410,
> b_blocknr=71213169107797378

Or in hex, block=0x582 b_blocknr=0x00fd000000000582: something has
corrupted the upper short of the bufheader's block number with 0xfd.

> Feb  1 17:04:13 mm-home1 Unable to handle kernel NULL pointer dereference at
> virtual address 00000000
> Feb  1 17:04:13 mm-home1 EIP is at flush_commit_list+0x229/0x3ef
> Feb  1 17:04:13 mm-home1 Process pdflush (pid: 164, threadinfo=f7e10000
> task=f7c15030)

And ReiserFS is justifiably surprised that no bufheader could be
found for one of its journal pages.

> Feb  1 17:04:13 mm-home1 Badness in do_exit at kernel/exit.c:796

Concomitant fallout from the above fault.

> Feb  1 17:04:14 mm-home1 kdm[10322]: X server for display :0 terminated
> unexpectedly

Nothing to say why that was, but we already know the system is bad.

> Feb  1 17:04:14 mm-home1 Bad rmap: page c1ee7ee0 flags c0000014 count 1
> mapcount 0
> Feb  1 17:04:14 mm-home1 firefox-bin addr b5313000 ptpfn 69515 vm_flags 100077
> Feb  1 17:04:14 mm-home1 page mapping 00000000 95d4 vma mapping 00000000 b5313

A page is being unmapped which was not recorded as being mapped.  Could be
page table corruption.  I'd been hoping for a sequence of these, and would
then have looked for some commonality, but it's an isolated occurrence.
Probably related to the bufheader corruption.

> Feb  1 17:04:19 mm-home1 Unable to handle kernel paging request at virtual
> address 00180000
> Feb  1 17:04:19 mm-home1 Process kswapd0 (pid: 165, threadinfo=f7d8e000 
> Feb  1 17:04:19 mm-home1 EIP is at find_get_pages+0x33/0x54

The radix-tree lookup found 0x00180000 where it should have found a struct
page pointer or NULL: something has corrupted the upper short with 0x18.

> Feb  1 17:04:19 mm-home1 <6>note: kswapd0[165] exited with preempt_count 1

Concomitant fallout from the above fault.

> Feb  1 17:04:19 mm-home1 (krm-11020): GConf server is not in use, shutting
> down.
> Feb  1 17:04:21 mm-home1 kdm: :0[15897]: Abnormal termination of greeter for
> display :0, code 127, signal 0

Things are getting worse.

> Feb  1 17:04:29 mm-home1 login(pam_unix)[10286]: session opened for user root
> by LOGIN(uid=0)
> Feb  1 17:06:45 mm-home1 __find_get_block_slow() failed. block=1681,
> b_blocknr=23362423066986129

Or in hex, block=0x691 b_blocknr=0x0053000000000691: something has
corrupted the upper short of the bufheader's block number with 0x53.

Well, you're getting plenty of memory corruption, and there's some pattern
to it (bits 8-11 each time), but I can't guess where it's coming from,
I'm afraid.  The "Bad rmap", my speciality, looks merely incidental
to the more general memory corruption.

I know you already said you really need to use the nVidia driver for
xinerama, but it has to be suspect #1.  Any chance of doing without it
just for a day, to see what happens then?  Or would that force you into
such a different work pattern that it would prove nothing?

After that, the next thing to try is going back to 2.6.12: I think you
said this bad behaviour started with 2.6.13 (but I may be quite wrong
to assume that you were running 2.6.12 before).  Perhaps the problem
lies with your hardware, but started to manifest around the time you
moved to 2.6.13, we do need to rule that out.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux