Re: 2.6.23.1: mdadm/raid5 hung/d-state



On Thu, 8 Nov 2007, BERTRAND Joël wrote:

BERTRAND Joël wrote:

Chuck Ebbert wrote:

On 11/05/2007 03:36 AM, BERTRAND Joël wrote:

Neil Brown wrote:

On Sunday November 4, [email protected] wrote:

# ps auxww | grep D

USER PID %CPU %MEM VSZ RSS TTY STAT START TIMECOMMAND

root       273  0.0  0.0      0     0 ?        D    Oct21  14:40
[pdflush]
root       274  0.0  0.0      0     0 ?        D    Oct21  13:00
[pdflush]

After several days/weeks, this is the second time this has happened,
while doing regular file I/O (decompressing a file), everything on
the device went into D-state.

At a guess (I haven't looked closely) I'd say it is the bug that was
meant to be fixed by

commit 4ae3f847e49e3787eca91bced31f8fd328d50496

except that patch applied badly and needed to be fixed with
the following patch (not in git yet).
These have been sent to stable@ and should be in the queue for 2.6.23.2

    My linux-2.6.23/drivers/md/raid5.c contains your patch for a long
time :

...
        spin_lock(&sh->lock);
        clear_bit(STRIPE_HANDLE, &sh->state);
        clear_bit(STRIPE_DELAYED, &sh->state);

        s.syncing = test_bit(STRIPE_SYNCING, &sh->state);
        s.expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
        s.expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
        /* Now to look around and see what can be done */

        /* clean-up completed biofill operations */
        if (test_bit(STRIPE_OP_BIOFILL, &sh->ops.complete)) {
                clear_bit(STRIPE_OP_BIOFILL, &sh->ops.pending);
                clear_bit(STRIPE_OP_BIOFILL, &sh->ops.ack);
                clear_bit(STRIPE_OP_BIOFILL, &sh->ops.complete);
        }

        rcu_read_lock();
        for (i=disks; i--; ) {
                mdk_rdev_t *rdev;
                struct r5dev *dev = &sh->dev[i];
...

but it doesn't fix this bug.


Did that chunk starting with "clean-up completed biofill operations" end

up where it belongs? The patch with the big context moves it to adifferent

place from where the original one puts it when applied to 2.6.23...

Lately I've seen several problems where the context isn't enough to make
a patch apply properly when some offsets have changed. In some cases a
patch won't apply at all because two nearly-identical areas are being
changed and the first chunk gets applied where the second one should,
leaving nowhere for the second chunk to apply.

I always apply this kind of patches by hands, and no by patch command.Last patch sent here seems to fix this bug :


gershwin:[/usr/scripts] > cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md7 : active raid1 sdi1[2] md_d0p1[0]
      1464725632 blocks [2/1] [U_]

[=====>...............] recovery = 27.1% (396992504/1464725632)finish=1040.3min speed=17104K/sec


	Resync done. Patch fix this bug.

	Regards,

	JKB


Excellent!

I cannot easily re-produce the bug on my system so I will wait for thenext stable patch set to include it and let everyone know if it happensagain, thanks.

References:
- 2.6.23.1: mdadm/raid5 hung/d-state
  - From: Justin Piszcz <[email protected]>
- Re: 2.6.23.1: mdadm/raid5 hung/d-state
  - From: Neil Brown <[email protected]>
- Re: 2.6.23.1: mdadm/raid5 hung/d-state
  - From: BERTRAND Joël <[email protected]>
- Re: 2.6.23.1: mdadm/raid5 hung/d-state
  - From: Chuck Ebbert <[email protected]>
- Re: 2.6.23.1: mdadm/raid5 hung/d-state
  - From: BERTRAND Joël <[email protected]>
- Re: 2.6.23.1: mdadm/raid5 hung/d-state
  - From: BERTRAND Joël <[email protected]>

Prev by Date: Re: [patch 01/28] cpu alloc: The allocator
Next by Date: Re: [poll] Is the megafreeze development model broken?
Previous by thread: Re: 2.6.23.1: mdadm/raid5 hung/d-state
Next by thread: kernel BUG at mm/slab.c:3006
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]