Re: [OOPS] amrestore dies in kmem_cache_free 2.6.16.18 - cannot restore backups!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Chuck Ebbert wrote:
> In-Reply-To: <[email protected]>
> 
> On Tue, 23 May 2006 18:24:14 -0700, James Lamanna wrote:
> 
>> So I was able to recreate this problem on a vanilla 2.6.16.18 with the
>> following oops..
>> I'd say this is a serious regression since I cannot restore backups
>> anymore (I could with 2.6.14.x, but that kernel series had other
>> issues...)
> 
>> Unable to handle kernel paging request at ffff82bc81000030 RIP: <ffffffff801657d9>{kmem_cache_free+82}
>> PGD 0
>> Oops: 0000 [1] SMP
>> CPU 1
>> Modules linked in:
>> Pid: 5814, comm: amrestore Not tainted 2.6.16.18 #2
>> RIP: 0010:[<ffffffff801657d9>] <ffffffff801657d9>{kmem_cache_free+82}
>> RSP: 0018:ffff81007d4afcd8  EFLAGS: 00010086
>> RAX: ffff82bc81000000 RBX: ffff81004119d800 RCX: 000000000000001e
>> RDX: ffff81000000c000 RSI: 0000000000000000 RDI: 00000007f0000000
>> RBP: ffff81007ff0c800 R08: 0000000000000000 R09: 0000000000000400
>> R10: 0000000000000000 R11: ffffffff8014b3d6 R12: ffff810041311480
>> R13: 0000000000000400 R14: 0000000000000400 R15: ffff81007e676748
>> FS:  00002b7f39708020(0000) GS:ffff810041173bc0(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: ffff82bc81000030 CR3: 000000007de09000 CR4: 00000000000006e0
>> Process amrestore (pid: 5814, threadinfo ffff81007d4ae000, task ffff81007e2f8ae0)
>> Stack: 0000000000000000 0000000000000246 ffff8100413c9bc0 ffff81007ff0c800
>>        ffff8100413c9bc0 ffffffff8016dfdc ffff8100413c9bc0 ffff81007fe25408
>>        00000000ffffffea ffffffff803187e7
>> Call Trace: <ffffffff8016dfdc>{bio_free+48} <ffffffff803187e7>{scsi_execute_async+640}
>>        <ffffffff8035d8d2>{st_do_scsi+422} <ffffffff8035d6e2>{st_sleep_done+0}
>>        <ffffffff80362950>{st_read+855} <ffffffff8013e1ca>{autoremove_wake_function+0}
>>        <ffffffff80169d7c>{vfs_read+171} <ffffffff8016a0af>{sys_read+69}
>>        <ffffffff8010a93e>{system_call+126}
>>
>> Code: 48 8b 48 30 0f b7 51 28 65 8b 04 25 30 00 00 00 39 c2 0f 84
>> RIP <ffffffff801657d9>{kmem_cache_free+82} RSP <ffff81007d4afcd8>
>> CR2: ffff82bc81000030
> 
> First of all, to really see what is happening you need to recompile your kernel
> after adding some debug options:
> 
> Kernel Hacking --->
>    [*] Kernel debugging
>    [*]   Debug memory allocations
>    [*]   Compile the kernel with frame pointers
> 
> (Frame pointers won't give an exact trace but they'll prevent the tail merging
> that makes it so hard to follow.)
> 
> Then reproduce the error and send the oops and any new error messages you see.
> Don't send the whole boot log and .config again -- we have them already.
> 
> The bug is happening here, in __cache_free, in code that's only included
> on NUMA machines:
> 
> static inline void __cache_free(struct kmem_cache *cachep, void *objp)
> {
>         struct array_cache *ac = cpu_cache_get(cachep);
> 
>         check_irq_off();
>         objp = cache_free_debugcheck(cachep, objp, __builtin_return_address(0));
> 
>         /* Make sure we are not freeing a object from another
>          * node to the array cache on this cpu.
>          */
> #ifdef CONFIG_NUMA
>         {
>                 struct slab *slabp;
>                 slabp = virt_to_slab(objp);                      <==== OOPS
>                 if (unlikely(slabp->nodeid != numa_node_id())) {
>                         struct array_cache *alien = NULL;
>                         int nodeid = slabp->nodeid;
> 
> 
> Tracing through the nested inline functions, we have:
> 
> static inline struct slab *virt_to_slab(const void *obj)
> {
>         struct page *page = virt_to_page(obj);
>         return page_get_slab(page);                              <==== OOPS
> }
> 
> static inline struct slab *page_get_slab(struct page *page)
> {
>         return (struct slab *)page->lru.prev;                    <==== OOPS
> }
> 
> 
> virt_to_page() returned a struct page * that pointed to unmapped memory.
> 
> 
> This all came from scsi_execute_async, possibly through this path:
> 
> scsi_execute_async
>     scsi_rq_map_sg: some kind of error occurred?
>         bio_endio
>             bio->bi_end_io ==> scsi_bi_end_io
>                 bio_put
>                     bio->bi_destructor ==> bio_fs_destructor
>                         bio_free
>                             mempool_free
>                                 kmem_cache_free
> 
> scsi_execute_async and scsi_rq_map_sg were rewritten last December, so may have
> new bugs.
> 
> 

Sorry for the late reply. I have been traveling.

Maybe I messed up on the bounce code usage. Are you using st's direct IO
feature?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux