On Mon, 6 Mar 2006, Jesper Juhl wrote:
>
> Slab corruption: start=f72948a0, len=64
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270)
> 000: 70 00 05 00 00 00 00 0a 00 00 00 00 24 00 00 00
> 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> Prev obj: start=f7294854, len=64
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<c0173923>](real_lookup+0x93/0xe0)
> 000: 6c 69 62 6b 64 65 69 6e 69 74 5f 6b 69 6f 5f 68
> 010: 74 74 70 5f 63 61 63 68 65 5f 63 6c 65 61 6e 65
> Next obj: start=f72948ec, len=64
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<c01c7c80>](ext3_init_block_alloc_info+0x20/0x70)
> 000: 10 cf ce f7 00 00 00 00 00 00 00 00 00 00 00 00
> 010: 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Ok, this is interesting because the previous and the next objects look
fine, which implies that it's not something else that overwrote it.
Perhaps more importantly, the actual corrupted object _should_ contain the
POISON_FREE bytes, but doesn't. In fact, what it _does_ contain is
apparently perfectly valid SCSI sense request data (if I read it right,
it's ASC/ASCQ of 24/00 means "Invalid field in cdb").
So that "last user: sr_do_ioctl" thing actually matches what the contents
are.
Your other smal corruption:
> Slab corruption: start=f70aeab4, len=64
> Redzone: 0x5a2cf071/0x5a2cf071.
> Last user: [<c02934eb>](sr_do_ioctl+0x11b/0x270)
> 000: 70 00 02 00 00 00 00 0a 00 00 00 00 3a 01 00 00
> 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> Prev obj: start=f70aea68, len=64
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<c023d6df>](init_dev+0x5cf/0x630)
> 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> Next obj: start=f70aeb00, len=64
> Redzone: 0x170fc2a5/0x170fc2a5.
> Last user: [<c0173923>](real_lookup+0x93/0xe0)
> 000: 6c 69 62 62 6f 6f 73 74 5f 70 72 67 5f 65 78 65
> 010: 63 5f 6d 6f 6e 69 74 6f 72 2d 67 63 63 2d 31 5f
is exactly the same thing. The objects around it look ok. The object that
should be poisoned again looks like it contains normal request-sense data
(this time 3a/01: "Medium not present - tray closed", if I read it right).
So it really looks like in this case, the sr_do_ioctl() function free'd
the sense key object, but that the sense data was actually written _after_
the object was free'd.
That in turn would imply that scsi_execute() returned before the SCSI
command had actually completed. Now, scsi_execute() just calls to the
generic block device layer, which in turn just submits it, and then waits
for its completion, so if it returns too early, that means that it got
_completed_ too early by the driver.
Alternatively, it got re-tried. The reason I mention that is that we have
commit 17e01f216b611fc46956dcd9063aec4de75991e3, which changes
scsi_execute() to add retries.
I wonder if something does a "complete(rq->waiting)" while the thing is
still retrying? In general, I do not believe that we should retry special
commands that have been initiated by a user, we should return the error.
But I haven't thought this through.
Anyway, Jesper, I see two potential reasons for this bug:
- total and utter slab confusion (the slab layer returned the same slab
allocation twice to two different callers). I consider this pretty
unlikely, because it's such a _major_ failure of the slab code, and the
slab code hasn't changed that much, but I mention it just in case.
- SCSI layer breakage. It might well be the low-level driver completing a
request too early, or it migth be the re-trying. If it's the re-trying,
you could try just reverting that commit I pointed to (ie if you're a
git user, just do "git revert 17e01f21", otherwise you'd need to look
it up from gitweb and un-apply the patch)
Regardless, Jesper, it would be great to hear _what_ strange CDROM device
you have that would implied in sr_ioctl.c - is it USB, SATA or something
else?
James, Mike, can you double-check the retries? In particular, it's _wrong_
to retry after you've already marked a command completed with
"complete(rq->waiting)", so if that happens somewhere, things are really
broken.
Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]