----- Original Message -----
From: "David Chinner" <[email protected]>
To: "Haar János" <[email protected]>
Cc: "David Chinner" <[email protected]>; <[email protected]>;
<[email protected]>
Sent: Monday, December 18, 2006 7:24 AM
Subject: Re: xfslogd-spinlock bug?
> On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote:
> > > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar János wrote:
> > > > I dont know there is a context between 2 messages, but i can see,
the
> > > > spinlock bug comes always on cpu #3.
> > > >
> > > > Somebody have any idea?
> > >
> > > Your disk interrupts are directed to CPU 3, and so log I/O completion
> > > occurs on that CPU.
> >
> > CPU0 CPU1 CPU2 CPU3
> > 0: 100 0 0 4583704 IO-APIC-edge
timer
> > 1: 0 0 0 2 IO-APIC-edge
i8042
> > 4: 0 0 0 3878668 IO-APIC-edge
serial
> .....
> > 14: 3072118 0 0 181 IO-APIC-edge
ide0
> .....
> > 52: 0 0 0 213052723 IO-APIC-fasteoi
eth1
> > 53: 0 0 0 91913759 IO-APIC-fasteoi
eth2
> > 100: 0 0 0 16776910 IO-APIC-fasteoi
eth0
> ....
> >
> > Maybe....
> > I have 3 XFS on this system, with 3 source.
> >
> > 1. 200G one ide hdd.
> > 2. 2x200G mirror on 1 ide + 1 sata hdd.
> > 3. 4x3.3TB strip on NBD.
> >
> > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on
the
> > CPU0.
>
> I'd say your NBD based XFS filesystem is having trouble.
>
> > > Are you using XFS on a NBD?
> >
> > Yes, on the 3. source.
>
> Ok, I've never heard of a problem like this before and you are doing
> something that very few ppl are doing (i.e. XFS on NBD). I'd start
> Hence I'd start by suspecting a bug in the NBD driver.
Ok, if you have right, this also can be in context with the following issue:
http://download.netcenter.hu/bughunt/20061217/messages.txt (10KB)
>
> > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0 EFLAGS: 00010002
> > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b
RCX:
> > > ^^^^^^^^^^^^^^^^
> > > Anyone recognise that pattern?
> >
> > I think i have one idea.
> > This issue can stops sometimes the 5sec automatic restart on crash, and
this
> > shows possible memory corruption, and if the bug occurs in the IRQ
> > handling.... :-)
> > I have a lot of logs about this issue, and the RAX, RBX always the same.
>
> And is this the only place where you see the problem? Or are there
> other stack traces that you see this in as well?
I have used the 2.6.16.18 for a long time, and it was stable, except this
issue. (~20 dump with xfslogd)
And i try the new releases, and now i have more. :-)
What do you think exactly?
I can see in the logs, but search for what?
The RAX, RBX thing, or the xfslogd-spinlock problem or the old nbd-deadlock
+ mem corruption?
[root@NetCenter netlog]# grep "0000000000000033" messages*
messages.1:Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 12 18:16:35 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 13 11:40:05 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 14 22:25:32 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 15 06:24:44 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.11:Oct 3 19:49:44 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.11:Oct 7 01:11:17 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.13:Sep 21 15:35:31 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.15:Sep 3 16:13:35 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.15:Sep 5 21:00:38 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec 9 00:10:47 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec 9 14:07:01 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec 10 04:44:48 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.3:Nov 30 10:59:21 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.3:Dec 2 00:54:23 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 13 10:44:49 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 14 03:14:14 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 14 03:37:07 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 15 01:39:54 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 6 14:48:54 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 7 04:36:13 dy-base RAX: 0000000000000033 RBX:
ffff8100057d2080 RCX: ffff810050d638f8
messages.6:Nov 7 04:36:13 dy-base RDX: 0000000000000008 RSI:
0000000000012cff RDI: 0000000000000033
messages.6:Nov 7 11:12:06 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 8 03:20:38 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 8 15:02:16 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 8 15:27:12 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 10 15:29:43 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 11 20:44:14 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.9:Oct 18 15:31:02 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.9:Oct 19 13:53:24 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
>
> > > This implies a spinlock inside a wait_queue_head_t is corrupt.
> > >
> > > What are you type of system do you have, and what sort of
> > > workload are you running?
> >
> > OS: Fedora 5, 64bit.
> > HW: dual xeon, with HT, ram 4GB.
> > (the min_free_kbytes limit is set to 128000, because sometimes the e1000
> > driver run out the reserved memory during irq handling.)
>
> That does not sound good. What happens when it does run out of memory?
This is an old problem, on 2.6.16.18 .
The default min_free_kbytes is 38xx , and the GIGE controller easily can be
overflow this little place.
If this happens, the system freez, and i can only use the serial console +
sysreq to dump stack:
download.netcenter.hu/bughunt/20060530/261618-good.txt
download.netcenter.hu/bughunt/20060530/dmesg.txt
download.netcenter.hu/bughunt/20060530/dump.txt
This problem is already fixed with set the min_free_kbytes to 128M.
> Is that when you start to see the above corruptions?
I think no, but i am not 100% sure.
Cheers,
Janos
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]