Re: kernel + gcc 4.1 = several problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Tue, 2 Jan 2007, Alistair John Strachan wrote:
>
> eax: 00000008   ebx: 00000000   ecx: 00000008   edx: 00000000
> esi: f70f3e9c   edi: f7017c00   ebp: f70f3c1c   esp: f70f3c0c
>
> Code: 58 01 00 00 0f 4f c2 09 c1 89 c8 83 c8 08 85 db 0f 44 c8 8b 5d f4 89 c8 
> 8b 75 f8 8b 7d fc 89 ec 5d c3 89 ca 8b 46 6c 83 ca 10 3b <87> 68 01 00 00 0f 
> 45 ca eb b6 8d b6 00 00 00 00 55 b8 01 00 00
> EIP: [<c0156f60>] pipe_poll+0xa0/0xb0 SS:ESP 0068:f70f3c0c
> 
> Chuck observed that the kernel tries to reenter pipe_poll half way through an 
> instruction (c0156f5f->c0156f60); it's not a single-bit error but an 
> off-by-one.

It's not an off-by-one either (eg say we're taking an exception and 
screiwing up %eip by one somehow).

The code sequence in question is

	mov    %ecx,%edx
	mov    0x6c(%esi),%eax
	or     $0x10,%edx
	cmp    0x168(%edi),%eax		<--
	cmovne %edx,%ecx
	jmp    ...

and it's in the second byte of the "cmp".

And yes, it definitely entered there, because trying other random 
entry-points will have either invalid instructions or instructions that 
would fault due to NULL pointers. HOWEVER, it's also not as simple as 
"took an interrupt, and returned with %eip incremented by one", becasue 
your %edx is zero, so it won't have done that "or $10,%edx" and then some 
interrupt happened and screwed up just %eip.

So it's literally a random %eip, but since you say it's consistently in 
that function, it's not truly "random". There's something that triggers it 
just _there_.

However, that's a damn simple function. There's _nothing_ there. The 
particular code that is involved right there is literally

	if (!pipe->writers && filp->f_version != pipe->w_counter)
		mask |= POLLHUP;

and that's it.  There's not even anything half-way interesting around it, 
except for the "poll_wait()" call, but even that is about as common as
you can humanly get..

Looking at the register set and the stack, I see:

	Stack:	00000000
		00000000  <- saved %ebx (dunno, seems dead in caller)
		f70f3e9c  <- saved %esi (== pollfd in do_pollfd)
		f6e111c0  <- saved %edi	(== filp)
		f70f3fa4  <- outer EBP (looks reasonable) 
		c015d7f3  <- return address (do_sys_poll+0x253/0x480)

and the strange thing is that when the oops happens, it really looks like 
%esi _still_ contains the value it had originally (and that is saved on 
the stack). But afaik, from your disassembly, it should have been 
overwritten by the initial %eax, which should have had the same value as 
%edi on entry...

IOW, none of it really makes any sense. The stack frames look fine, so we 
_did_ enter at the beginning of the function (and it wasn't the *poll fn 
pointer that was corrupt.

> The suggestions I've had so far which I have not yet tried:
> 
> -	Select a different x86 CPU in the config.
> 		-	Unfortunately the C3-2 flags seem to simply tell GCC
> 			to schedule for ppro (like i686) and enabled MMX and SSE
> 		-	Probably useless

Actually, try this one. Try using something that doesn't like "cmov". 
Maybe the C3-2 simply has some internal cmov bugginess. 

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux