panic: "attempting to free lock on active lock list"

Hello,

I'm running a RedHat/Centos-modified 2.6.9 (-34.EL) in an extremely busyweb-service-on-NFS environment. (Tons of small files, user homepages,and other things I try not to consider.) I know I'm not running thelatest 2.6.16 kernel on these boxes, so if the immediate response is togo back and do that, I will do so. However, I am really hoping that oneof the folks here will see this message and say "I remember that bug!"and be able to point me at a patch. I've done a thorough mailing listsearch and have tried some of the suggestions that I found, so pleaseread below.

My systems are all SMP and I can reproduce this on 2x 800mhz Compaqboxes and 2x 1.3ghz IBM boxes. I haven't tried it on any UP boxes, norhave I tried it with a UP kernel on a SMP machine.

The boxes were hanging hard with a "attempting to free lock on activelock list" panic message, with no further debugging information. Ondigging around the mailing lists, my best guess is a poor interactionbetween NFS and the FS layer, but it could also be just collateraldamage from some other problem. Per a suggestion from Chris Wright onLKML back in Jan 2005, I changed the panic to do BUG_ONs, to print morediagnostics. That gives me this crash dump:


Apr 11 14:15:01 bos-tri-members36 kernel: Attempting to free lock on active lock list------------[ cut here ]------------
Apr 11 14:15:01 bos-tri-members36 kernel: kernel BUG at fs/locks.c:173!
Apr 11 14:15:01 bos-tri-members36 kernel: invalid operand: 0000 [#1]

Apr 11 14:15:01 bos-tri-members36 kernel: SMPApr 11 14:15:01 bos-tri-members36 kernel: Modules linked in: iptable_filter ip_tables md5 ipv6 parport_pc lp parport autofs4 i2c_dev i2c_core nfs lockd sunrpc dm_mirror dm_mod button battery ac ohci_hcd tg3 floppy sg ext3 jbd mptscsih mptbase sd_mod scsi_mod

Apr 11 14:15:01 bos-tri-members36 kernel: CPU:    1
Apr 11 14:15:01 bos-tri-members36 kernel: EIP:    0060:[<c016c0ba>]    Not tainted VLI

Apr 11 14:15:01 bos-tri-members36 kernel: EFLAGS: 00010216 (2.6.9-22.0.3.EL.lycossmp)Apr 11 14:15:01 bos-tri-members36 kernel: EIP is at __posix_lock_file+0x56a/0x5b6

Apr 11 14:15:01 bos-tri-members36 kernel: eax: 0000002b   ebx: c02e6108   ecx: f5b19ee0   edx: c02e6108
Apr 11 14:15:01 bos-tri-members36 kernel: esi: 00000000   edi: d2535c0c   ebp: 00000000   esp: f5b19ee0
Apr 11 14:15:01 bos-tri-members36 kernel: ds: 007b   es: 007b   ss: 0068
Apr 11 14:15:01 bos-tri-members36 kernel: Process sendmail (pid: 23465, threadinfo=f5b19000 task=f4a32030)

Apr 11 14:15:01 bos-tri-members36 kernel: Stack: f7f55f80 e24cbd5c 00000000 00000000 00000000 00000000 00000000 00000000Apr 11 14:15:01 bos-tri-members36 kernel: f5b80c68 00000000 00000000 d2535c0c 00000000 d2535c0c d2535c0c f5b19f78Apr 11 14:15:01 bos-tri-members36 kernel: 00000000 ce085c80 c016cfb1 00000000 f5b80bc0 00000007 443bf225 00000000Apr 11 14:15:01 bos-tri-members36 kernel: Call Trace:

Apr 11 14:15:01 bos-tri-members36 kernel:  [<c016cfb1>] fcntl_setlk+0x169/0x2b2
Apr 11 14:15:01 bos-tri-members36 kernel:  [<c0161f9b>] sys_fstat64+0x1e/0x23
Apr 11 14:15:01 bos-tri-members36 kernel:  [<c0169307>] do_fcntl+0x10c/0x155
Apr 11 14:15:01 bos-tri-members36 kernel:  [<c0169416>] sys_fcntl64+0x6c/0x7d
Apr 11 14:15:02 bos-tri-members36 kernel:  [<c02d14d7>] syscall_call+0x7/0xb
Apr 11 14:15:02 bos-tri-members36 kernel:  [<c02d007b>] _spin_lock+0x2e/0x34

Apr 11 14:15:02 bos-tri-members36 kernel: Code: 2e c0 e8 ac 61 fb ff 5f 0f 0b a8 00 ce 60 2e c0 8b 44 24 2c 8b 7c 24 2c 83 c0 04 39 47 04 74 13 68 08 61 2e c0 e8 89 61 fb ff 5b <0f> 0b ad 00 ce 60 2e c0 8b 54 24 2c 8b 42 4c 85 c0 74 18 8b 50Apr 11 14:15:02 bos-tri-members36 kernel: <0>Fatal exception: panic in 5 seconds

Later in that same thread, Trond Myklebust provided a patch to changeposix_lock_file() to posix_lock_file_wait() and that solved thatspecific user's problems. However, that small patch is alreadyback-ported to this 2.6.9 kernel. (You know how those distributionsare...) I've searched the mailing lists for other applicable posts, buthave come up empty.

I am running a very large number of servers running this same version ofCentos and only the machines with the highest NFS loads appear totrigger this problem. However, I'm not positive that it is loadrelated... reducing the load by 30% (by adding more servers) didn't dothe trick and it's difficult for me to tell if the problem just happensless now. (I have an ad-hoc watchdog system that flips the power on theboxes several minutes after they stop responding to pings. Desperatetimes...) I also have many other servers that do similar workloads thatare unaffected, so it could be a specific condition in our softwarewhich is causing the panic(), but I can't trace it.

Thinking it was NFS, we have tried a variety of combinations includingusing "nolock", dialing down to NFSv2, using UDP-only, etc. None ofthose change the situation. I haven't taken the machine down to oneprocessor because they tend to melt under the load, but I can try. I'vealso found a mailing list post that suggested to use an olderRedHat/Centos kernel which worked for someone else, but that didn't workfor me, either. (The NFS server is a NetApp, but I don't have any otherunits I can test against with the same load characteristics. There aremany TB of data involved.)

Does anyone have any suggestions? I don't mind getting my hands dirtytesting ideas.


Thanks very much for your assistance,

Joe Pranevich

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Prev by Date: [RFC] Making percpu module variables have their own memory.
Next by Date: [RFD][PATCH] typhoon and core sample for folding away VLAN stuff (was: Re: [PATCH] deinline a few large functions in vlan code v2)
Previous by thread: [RFC] Making percpu module variables have their own memory.
Next by thread: Linux 2.6.16.5
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]