Re: 3ware glitches cause softraid rebuilds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jim,

The 3ware driver reset code is doing msleep() polling with local interrupts
enabled... You shouldn't be getting soft lockups.  I'll do some investigation
into this.

BTW, this is a linux-scsi issue, not linux-kernel.

-Adam

On 8/28/06, Jim Klimov <[email protected]> wrote:
Hello linux-kernel,

  In July I wrote about problems with our Xeon/3ware fileservers which
  occasionally dump a stacktrace, usually of smbd somewhere in iptables
  routines, and reboot. Possibly this can be because one of PCI-X buses
  is shared by the 3Ware controller and onboard Intel Gigabit chips,
  and some interrupt events have a conflict? Regardless, it's not good
  anyway :)

  Then I went on vacation and didn't reply in a timely fashion to a
  suggestion to build a kernel with more debugging info. Now I did it
  with a newer kernel and wait for those bugs to repeat.

  Meanwhile I found a new strange behavior, not seen before. Sometimes
  a 3Ware card times out, and usually begins to verify itself. That's
  "normal", or rather usual and not considered critical. But now in
  exactly 10 seconds a soft lockup bug happens and the raid1 arrays
  (made by the linux md driver from 2 system ide drives) are rebuilt.

  So far this has hit us twice:

Kernel: 2.6.17.11 vanilla

[  423.569222] 3w-9xxx: scsi0: AEN: INFO (0x04:0x0029): Verify started:unit=1.
[  429.688317] 3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Verify started:unit=0.
[ 7042.478001] 3w-9xxx: scsi0: AEN: INFO (0x04:0x002B): Verify completed:unit=1.
[ 8387.242115] 3w-9xxx: scsi1: AEN: INFO (0x04:0x003D): Verify paused:unit=0.
[ 8402.400749] 3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Verify started:unit=0.
[17184.832241] sd 1:0:0:0: WARNING: (0x06:0x002C): Command (0x88) timed out,
resetting card.
[17194.818537] BUG: soft lockup detected on CPU#0!
[17194.823249]  <c010396f> show_trace+0xd/0xf  <c0103a5a>
dump_stack+0x17/0x1b
[17194.825809]  <c0136ec8> softlockup_tick+0xa9/0xc2  <c0124be3>
run_local_timers+0x12/0x14
[17194.825834]  <c01249e6> update_process_times+0x34/0x78  <c010f5e7>
smp_apic_timer_interrupt+0x58/0x60
[17194.825858]  <c010359c> apic_timer_interrupt+0x1c/0x24  <c02c8d51>
twa_reset_sequence+0x37/0x1e1
[17194.825884]  <c02c8c90> twa_reset_device_extension+0xe9/0x173  <c02c9003>
twa_scsi_eh_reset+0x6e/0xe4
[17194.825908]  <c02bcd91> scsi_try_host_reset+0x43/0xb0  <c02bcfc5>
scsi_eh_host_reset+0x2f/0xf4
[17194.825932]  <c02bd40d> scsi_eh_ready_devs+0x56/0x65  <c02bd5ba>
scsi_unjam_host+0xa3/0x1d0
[17194.825954]  <c02bd76a> scsi_error_handler+0x83/0x112  <c012dd24>
kthread+0x9f/0xa4
[17194.825977]  <c0101009> kernel_thread_helper+0x5/0xb
[17240.675713] 3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Verify
started:unit=0.
[19019.998212] md: syncing RAID array md0
[19019.998231] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[19019.998243] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[19019.998268] md: using 128k window, over a total of 500352 blocks.
[19020.022029] md: delaying resync of md1 until md0 has finished resync
(they share one or more physical units)
[19020.079784] md: delaying resync of md10 until md1 has finished resync
(they share one or more physical units)
[19020.079813] md: delaying resync of md1 until md0 has finished resync
(they share one or more physical units)
[19020.265076] md: delaying resync of md11 until md10 has finished resync
(they share one or more physical units)
[19020.265106] md: delaying resync of md1 until md0 has finished resync
(they share one or more physical units)
[19020.265130] md: delaying resync of md10 until md1 has finished resync
(they share one or more physical units)
[19020.392908] md: delaying resync of md12 until md1 has finished resync
(they share one or more physical units)
[19020.392930] md: delaying resync of md10 until md12 has finished resync
(they share one or more physical units)
[19020.392953] md: delaying resync of md1 until md0 has finished resync
(they share one or more physical units)
[19020.392972] md: delaying resync of md11 until md12 has finished resync
(they share one or more physical units)
[19020.460745] md: delaying resync of md3 until md0 has finished resync
(they share one or more physical units)
[19020.468102] md: delaying resync of md5 until md3 has finished resync
(they share one or more physical units)
[19020.468122] md: delaying resync of md11 until md12 has finished resync
(they share one or more physical units)
[19020.468145] md: delaying resync of md1 until md5 has finished resync
(they share one or more physical units)
[19020.468162] md: delaying resync of md10 until md12 has finished resync
(they share one or more physical units)
[19020.468190] md: delaying resync of md3 until md0 has finished resync
(they share one or more physical units)
[19020.468210] md: delaying resync of md12 until md5 has finished resync
(they share one or more physical units)
[19020.476311] md: delaying resync of md6 until md12 has finished resync
(they share one or more physical units)
[19020.476334] md: delaying resync of md10 until md12 has finished resync
(they share one or more physical units)
[19020.476362] md: delaying resync of md1 until md5 has finished resync
(they share one or more physical units)
[19020.476382] md: delaying resync of md11 until md12 has finished resync
(they share one or more physical units)
[19020.476404] md: delaying resync of md12 until md5 has finished resync
(they share one or more physical units)
[19020.476434] md: delaying resync of md3 until md0 has finished resync
(they share one or more physical units)
[19020.476455] md: delaying resync of md5 until md3 has finished resync
(they share one or more physical units)
[19020.486455] md: delaying resync of md7 until md12 has finished resync
(they share one or more physical units)
[19020.486478] md: delaying resync of md11 until md12 has finished resync
(they share one or more physical units)
[19020.486503] md: delaying resync of md1 until md5 has finished resync
(they share one or more physical units)
[19020.486519] md: delaying resync of md10 until md12 has finished resync
(they share one or more physical units)
[19020.486542] md: delaying resync of md5 until md3 has finished resync
(they share one or more physical units)
[19020.486571] md: delaying resync of md3 until md0 has finished resync
(they share one or more physical units)
[19020.486593] md: delaying resync of md12 until md5 has finished resync
(they share one or more physical units)
[19020.486612] md: delaying resync of md6 until md12 has finished resync
(they share one or more physical units)
[19020.498362] md: delaying resync of md8 until md12 has finished resync
(they share one or more physical units)
[19020.498394] md: delaying resync of md12 until md5 has finished resync
(they share one or more physical units)
[19020.498414] md: delaying resync of md10 until md12 has finished resync
(they share one or more physical units)
[19020.498438] md: delaying resync of md1 until md5 has finished resync
(they share one or more physical units)
[19020.498457] md: delaying resync of md11 until md12 has finished resync
(they share one or more physical units)
[19020.498475] md: delaying resync of md6 until md12 has finished resync
(they share one or more physical units)
[19020.498509] md: delaying resync of md3 until md0 has finished resync
(they share one or more physical units)
[19020.498534] md: delaying resync of md5 until md3 has finished resync
(they share one or more physical units)
[19020.498553] md: delaying resync of md7 until md12 has finished resync
(they share one or more physical units)
[19020.612240] md: delaying resync of md9 until md12 has finished resync
(they share one or more physical units)
[19020.612309] md: delaying resync of md7 until md12 has finished resync
(they share one or more physical units)
[19020.612338] md: delaying resync of md5 until md3 has finished resync
(they share one or more physical units)
[19020.612371] md: delaying resync of md3 until md0 has finished resync
(they share one or more physical units)
[19020.612392] md: delaying resync of md6 until md12 has finished resync
(they share one or more physical units)
[19020.612412] md: delaying resync of md11 until md12 has finished resync
(they share one or more physical units)
[19020.612435] md: delaying resync of md1 until md5 has finished resync
(they share one or more physical units)
[19020.612456] md: delaying resync of md10 until md12 has finished resync
(they share one or more physical units)
[19020.612481] md: delaying resync of md12 until md5 has finished resync
(they share one or more physical units)
[19020.612501] md: delaying resync of md8 until md12 has finished resync
(they share one or more physical units)
[19047.955571] md: md0: sync done.
[19047.996082] md: delaying resync of md8 until md12 has finished resync
(they share one or more physical units)
[19047.996105] md: delaying resync of md11 until md12 has finished resync
(they share one or more physical units)
[19047.996129] md: delaying resync of md6 until md12 has finished resync
(they share one or more physical units)
[19047.996143] md: delaying resync of md12 until md5 has finished resync
(they share one or more physical units)
[19047.996154] RAID1 conf printout:
[19047.996171] md: syncing RAID array md3
[19047.996179]  --- wd:2 rd:2
[19047.996191]  disk 0, wo:0, o:1, dev:hdc1
[19047.996199] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[19047.996211]  disk 1, wo:0, o:1, dev:hda1
[19047.996220] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[19047.996233] md: using 128k window, over a total of 5859904 blocks.
[19047.998803] md: delaying resync of md5 until md3 has finished resync
(they share one or more physical units)
[19047.998820] md: delaying resync of md7 until md12 has finished resync
(they share one or more physical units)
[19047.998833] md: delaying resync of md9 until md12 has finished resync
(they share one or more physical units)
[19048.005192] md: delaying resync of md1 until md5 has finished resync
(they share one or more physical units)
[19048.005209] md: delaying resync of md10 until md12 has finished resync
(they share one or more physical units)
[19363.392599] md: md3: sync done.
[19363.453882] md: delaying resync of md10 until md12 has finished resync
(they share one or more physical units)
[19363.453894] md: delaying resync of md9 until md12 has finished resync
(they share one or more physical units)
[19363.453926] md: delaying resync of md1 until md5 has finished resync
(they share one or more physical units)
[19363.453934] md: syncing RAID array md5
[19363.453942] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[19363.453950] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[19363.453963] md: using 128k window, over a total of 3906368 blocks.
[19363.453970] md: delaying resync of md7 until md12 has finished resync
(they share one or more physical units)
[19363.453985] md: delaying resync of md6 until md12 has finished resync
(they share one or more physical units)
[19363.454003] md: delaying resync of md11 until md12 has finished resync
(they share one or more physical units)
[19363.454122] RAID1 conf printout:
[19363.455970] md: delaying resync of md12 until md5 has finished resync
(they share one or more physical units)
[19363.455985] md: delaying resync of md8 until md12 has finished resync
(they share one or more physical units)
[19363.457941]  --- wd:2 rd:2
[19363.461140]  disk 0, wo:0, o:1, dev:hdc3
[19363.465654]  disk 1, wo:0, o:1, dev:hda3
[19514.855587] md: md5: sync done.
[19514.894392] md: delaying resync of md8 until md12 has finished resync
(they share one or more physical units)
[19514.894431] md: delaying resync of md12 until md1 has finished resync
(they share one or more physical units)
[19514.894467] md: syncing RAID array md1
[19514.894476] md: delaying resync of md11 until md12 has finished resync
(they share one or more physical units)
[19514.894489] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[19514.894497] md: delaying resync of md6 until md12 has finished resync
(they share one or more physical units)
[19514.894508] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[19514.894518] md: delaying resync of md7 until md12 has finished resync
(they share one or more physical units)
[19514.894531] md: delaying resync of md10 until md12 has finished resync
(they share one or more physical units)
[19514.894540] md: using 128k window, over a total of 2000256 blocks.
[19514.894628] RAID1 conf printout:
[19514.896747] md: delaying resync of md9 until md12 has finished resync
(they share one or more physical units)
[19514.898392]  --- wd:2 rd:2
[19514.901370]  disk 0, wo:0, o:1, dev:hdc5
[19514.905901]  disk 1, wo:0, o:1, dev:hda5
[19590.739467] md: md1: sync done.
[19590.797061] md: delaying resync of md10 until md12 has finished resync
(they share one or more physical units)
[19590.797080] md: delaying resync of md9 until md12 has finished resync
(they share one or more physical units)
[19590.797120] md: delaying resync of md7 until md12 has finished resync
(they share one or more physical units)
[19590.797131] md: syncing RAID array md12
[19590.797142] md: delaying resync of md6 until md12 has finished resync
(they share one or more physical units)
[19590.797162] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[19590.797174] md: delaying resync of md11 until md12 has finished resync
(they share one or more physical units)
[19590.797192] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[19590.797209] md: using 128k window, over a total of 5244992 blocks.
[19590.797232] RAID1 conf printout:
[19590.799289] md: delaying resync of md8 until md12 has finished resync
(they share one or more physical units)
[19590.800846]  --- wd:2 rd:2
[19590.803920]  disk 0, wo:0, o:1, dev:hdc2
[19590.808568]  disk 1, wo:0, o:1, dev:hda2
[19942.989053] md: md12: sync done.
[19943.085278] md: delaying resync of md8 until md7 has finished resync
(they share one or more physical units)
[19943.085291] md: delaying resync of md6 until md11 has finished resync
(they share one or more physical units)
[19943.085315] md: delaying resync of md9 until md8 has finished resync
(they share one or more physical units)
[19943.085340] md: delaying resync of md11 until md10 has finished resync
(they share one or more physical units)
[19943.085377] RAID1 conf printout:
[19943.085386] md: syncing RAID array md7
[19943.085396] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[19943.085404] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[19943.085417] md: using 128k window, over a total of 1953408 blocks.
[19943.086535] md: delaying resync of md10 until md9 has finished resync
(they share one or more physical units)
[19943.088928]  --- wd:2 rd:2
[19943.091916]  disk 0, wo:0, o:1, dev:hdc12
[19943.096249]  disk 1, wo:0, o:1, dev:hda12
[20033.019114] md: md7: sync done.
[20033.090440] md: delaying resync of md6 until md11 has finished resync
(they share one or more physical units)
[20033.090460] md: syncing RAID array md8
[20033.090470] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[20033.090477] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[20033.090489] md: using 128k window, over a total of 5859904 blocks.
[20033.090571] md: delaying resync of md10 until md9 has finished resync
(they share one or more physical units)
[20033.090589] md: delaying resync of md9 until md8 has finished resync
(they share one or more physical units)
[20033.090602] md: delaying resync of md11 until md10 has finished resync
(they share one or more physical units)
[20033.090646] RAID1 conf printout:
[20033.094083]  --- wd:2 rd:2
[20033.097113]  disk 0, wo:0, o:1, dev:hdc7
[20033.101318]  disk 1, wo:0, o:1, dev:hda7
[20283.332772] md: md8: sync done.
[20283.392384] md: delaying resync of md11 until md10 has finished resync
(they share one or more physical units)
[20283.392411] md: delaying resync of md6 until md11 has finished resync
(they share one or more physical units)
[20283.392432] RAID1 conf printout:
[20283.392479] md: syncing RAID array md9
[20283.392495] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[20283.392521] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[20283.392557] md: using 128k window, over a total of 3906368 blocks.
[20283.395898] md: delaying resync of md10 until md9 has finished resync
(they share one or more physical units)
[20283.396045]  --- wd:2 rd:2
[20283.399015]  disk 0, wo:0, o:1, dev:hdc8
[20283.403257]  disk 1, wo:0, o:1, dev:hda8
[20452.540323] md: md9: sync done.
[20452.610616] md: delaying resync of md6 until md11 has finished resync
(they share one or more physical units)
[20452.610633] md: syncing RAID array md10
[20452.610661] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[20452.610669] RAID1 conf printout:
[20452.610679]  --- wd:2 rd:2
[20452.610692]  disk 0, wo:0, o:1, dev:hdc9
[20452.610706]  disk 1, wo:0, o:1, dev:hda9
[20452.611783] md: delaying resync of md11 until md10 has finished resync
(they share one or more physical units)
[20452.625535] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[20452.625576] md: using 128k window, over a total of 3906368 blocks.
[20608.982515] md: md10: sync done.
[20609.027444] md: delaying resync of md6 until md11 has finished resync
(they share one or more physical units)
[20609.027460] md: syncing RAID array md11
[20609.027473] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[20609.027485] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[20609.027503] md: using 128k window, over a total of 65430144 blocks.
[20609.027558] RAID1 conf printout:
[20609.031138]  --- wd:2 rd:2
[20609.034093]  disk 0, wo:0, o:1, dev:hdc10
[20609.038435]  disk 1, wo:0, o:1, dev:hda10
[22231.702114] sd 1:0:0:0: WARNING: (0x06:0x002C): Command (0x88) timed out,
resetting card.
[22241.688436] BUG: soft lockup detected on CPU#1!
[22241.693331]  <c010396f> show_trace+0xd/0xf  <c0103a5a>
dump_stack+0x17/0x1b
[22241.695899]  <c0136ec8> softlockup_tick+0xa9/0xc2  <c0124be3>
run_local_timers+0x12/0x14
[22241.695929]  <c01249e6> update_process_times+0x34/0x78  <c010f5e7>
smp_apic_timer_interrupt+0x58/0x60
[22241.695961]  <c010359c> apic_timer_interrupt+0x1c/0x24  <c02c8d51>
twa_reset_sequence+0x37/0x1e1
[22241.695994]  <c02c8c90> twa_reset_device_extension+0xe9/0x173  <c02c9003>
twa_scsi_eh_reset+0x6e/0xe4
[22241.696027]  <c02bcd91> scsi_try_host_reset+0x43/0xb0  <c02bcfc5>
scsi_eh_host_reset+0x2f/0xf4
[22241.696059]  <c02bd40d> scsi_eh_ready_devs+0x56/0x65  <c02bd5ba>
scsi_unjam_host+0xa3/0x1d0
[22241.696086]  <c02bd76a> scsi_error_handler+0x83/0x112  <c012dd24>
kthread+0x9f/0xa4
[22241.696119]  <c0101009> kernel_thread_helper+0x5/0xb
[22335.486623] 3w-9xxx: scsi1: AEN: INFO (0x04:0x0029): Verify
started:unit=0.
[22887.014006] 3w-9xxx: scsi1: AEN: INFO (0x04:0x002B): Verify
completed:unit=0.
[23994.650143] md: md11: sync done.
[23994.804031] md: syncing RAID array md6
[23994.804051] RAID1 conf printout:
[23994.804065] md: minimum _guaranteed_ reconstruction speed: 1000
KB/sec/disc.
[23994.804075] md: using maximum available idle IO bandwidth (but not more
than 200000 KB/sec) for reconstruction.
[23994.804094] md: using 128k window, over a total of 1953408 blocks.
[23994.808154]  --- wd:2 rd:2
[23994.811253]  disk 0, wo:0, o:1, dev:hdc11
[23994.815967]  disk 1, wo:0, o:1, dev:hda11
[24225.817451] md: md6: sync done.
[24225.939633] RAID1 conf printout:
[24225.943358]  --- wd:2 rd:2
[24225.946614]  disk 0, wo:0, o:1, dev:hdc6
[24225.950917]  disk 1, wo:0, o:1, dev:hda6


--
Best regards,
 Jim Klimov                          mailto:[email protected]

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux