We've seen the below on at 2.6.5 kernel (SuSE SLES9) at SGI.
Does it look like your crash?
The panic is by kswapd0:
<1>Unable to handle kernel NULL pointer dereference (address
0000000000000078)
<4>kswapd0[122]: Oops 8813272891392 [1]
whose stack shows:
[<a0000001001cecf0>] clear_inode+0x1b0/0x2c0
[<a0000001001d03d0>] generic_drop_inode+0x3b0/0x400
[<a0000001001ccf30>] iput+0x130/0x1c0
[<a00000020b6f0cd0>] nfs_dentry_iput+0x170/0x1c0 [nfs]
[<a0000001001ca050>] prune_dcache+0x510/0x540
[<a0000001001ca0c0>] shrink_dcache_memory+0x40/0x80
[<a00000010014c360>] shrink_slab+0x2e0/0x440
Both generic_shutdown_super()'s calls to shrink_dcache_parent() or
shrink_dcache_anon(), and kswapd0's call to shrink_dcache_memory()
call prune_dcache().
I suspect a race condition inside prune_dcache().
The prune_dcache() function:
lock dcache_lock
scan the dentry_unused list of dentry's for a given number ("count") of
dentry's to free:
if a dentry to free, call prune_one_dentry()
dentry_iput()
unlock dcache_lock
iput() any associated inode
d_free() the dentry
lock dcache_lock
unlock dcache_lock
Two processors entering prune_dcache() near the same time will both scan
the dentry_unused list and could try to iput() the same inode twice. That is
because the dcache_lock is released while running iput().
I suppose the dcache_lock must be released here because the iput() may take
a long time. And the dcache_lock is used many places in the system
to protect the dentry cache's lists.
It would seem to me that a straighforward fix would be to add another
lock to protect just the scan of the dentry_unused list only here in
prune_dcache()
-Cliff Wickman
On Mon, Dec 19, 2005 at 04:38:55PM -0500, John F Flynn III wrote:
> Good evening, folks...
>
> We have been experiencing a very rare (on average once every two to
> three months) crash on some of our servers.
>
> uname -a:
> Linux cheetah 2.6.9-22.0.1.ELsmp #1 SMP Thu Oct 27 13:14:25 CDT 2005
> i686 i686 i386 GNU/Linux
>
> (This is a CentOS provided kernel)
>
> Here is a photo of the bottom of the panic. Unfortunately the kernel has
> no chance to log this anywhere else:
>
> http://www.cs.fiu.edu/~flynnj/cheetah-crash.jpg
>
>
> The crash appears to be in prune_dcache, and has happened on several
> distinct machines, so we do not believe it is a hardware problem.
>
> If anyone has pointers on what bug could be causing this crash, or if
> it's been fixed in newer kernels we could try, it would be greatly
> appreciated. This only seems to happen on loaded production machines,
> and it happens so rarely that more detailed debugging is nearly impossible.
>
> Thanks in advance,
> -John Flynn
>
> --
> John Flynn [email protected]
> =========================================================
> Systems and Network Administration /\_/\
> School of Computer Science ( O.O )
> Florida International University > <
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Cliff Wickman
Silicon Graphics, Inc.
[email protected]
(651) 683-3824
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]