Openmpi/xhpl kernel crash 2.6.17-rc3 with Pathscale htx

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Running hpl with openmpi over Infiniband gets me a crash.

Using hpl, openmpi 1.0.2, openib, and the 2.6.17-rc3 kernel.

I don't see the crash under ip over ib (ran for over an hour),
the crash occurs immediately upon attempting to start xhpl.

Here is the crash captured via the serial port:

[ 144.713555] ----------- [cut here ] --------- [please bite here ] ---------
[  144.720550] Kernel BUG at drivers/infiniband/hw/ipath/ipath_layer.c:757
[  144.727205] invalid opcode: 0000 [1] SMP
[  144.731334] CPU 0
[ 144.733419] Modules linked in: ipv6 autofs4 adm1026 hwmon_vid i2c_piix4 nfs lockd nfs_acl sunrpc dm_mirror dm_multipath dm_mod button battery ac ohci_hcd ehci_hcd i2c_nforce2 i2c_core shpchp snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ib_ipoib ib_ipath ipath_core ib_uverbs ib_umad ib_ucm ib_sa ib_cm ib_mad ib_core tg3 floppy sata_svw ext3 jbd sata_nv libata sd_mod scsi_mod
[  144.774643] Pid: 4771, comm: xhpl Not tainted 2.6.17-rc3 #1
[ 144.780244] RIP: 0010:[<ffffffff880f6984>] <ffffffff880f6984>{:ipath_core:ipath_verbs_send+362}
[  144.788858] RSP: 0018:ffffffff8051be38  EFLAGS: 00010246
[ 144.794409] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8100df4a0150 [ 144.801574] RDX: ffffc200003b1078 RSI: 0000000000000000 RDI: ffff8100df4a0150 [ 144.808742] RBP: 0000000000000000 R08: ffff8100df4a0158 R09: 0000000000000018 [ 144.815910] R10: 0000000000000018 R11: 0000000000000246 R12: ffffc2000026f020 [ 144.823071] R13: 0000000000000000 R14: 0000000000000018 R15: 0000000000000000 [ 144.830230] FS: 00002b750d6fcca0(0000) GS:ffffffff805ad000(0000) knlGS:0000000000000000
[  144.838398] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 144.844190] CR2: 000000000047f050 CR3: 000000000cdb1000 CR4: 00000000000006e0 [ 144.851370] Process xhpl (pid: 4771, threadinfo ffff81000ceba000, task ffff81000cc9e880) [ 144.859504] Stack: ffffffff8059d900 ffff8100df4a0150 00000018dfef1000 ffff8100df4a0120 [ 144.867549] ffff8100df4a0000 ffffffff805f7d88 ffff8100df4a0098 0000000000000038
[  144.875829]        0000000000000400 ffffffff8811869e
[ 144.881079] Call Trace: <IRQ> <ffffffff8811869e>{:ib_ipath:ipath_do_rc_send+348} [ 144.888727] <ffffffff80232548>{do_timer+58} <ffffffff8020d0bb>{main_timer_handler+493} [ 144.897498] <ffffffff8022efc6>{tasklet_hi_action+105} <ffffffff8022ebc4>{__do_softirq+80} [ 144.906525] <ffffffff8020aa5a>{call_softirq+30} <ffffffff8020bc0a>{do_softirq+47} [ 144.914854] <ffffffff8020bbd1>{do_IRQ+62} <ffffffff80209b96>{ret_from_intr+0} <EOI> [ 144.923395] <ffffffff8026aafe>{kfree+417} <ffffffff880e14ff>{:ib_uverbs:ib_uverbs_poll_cq+409} [ 144.932867] <ffffffff880dfa27>{:ib_uverbs:ib_uverbs_write+196} <ffffffff8026f746>{vfs_write+212} [ 144.942509] <ffffffff8026f897>{sys_write+69} <ffffffff80209612>{system_call+126}
[  144.950997]
[ 144.950998] Code: 0f 0b 68 84 21 10 88 c2 f5 02 eb 07 44 39 f3 41 0f 47 de 48 [ 144.960709] RIP <ffffffff880f6984>{:ipath_core:ipath_verbs_send+362} RSP <ffffffff8051be38> [ 144.969212] <3>BUG: sleeping function called from invalid context at include/linux/rwsem.h:43
[  144.977952] in_atomic():1, irqs_disabled():0
[  144.982261]
[  144.982262] Call Trace: <IRQ> <ffffffff80221daa>{__might_sleep+190}
[ 144.990056] <ffffffff80216103>{flat_send_IPI_mask+0} <ffffffff80236073>{blocking_notifier_call_chain+31} [ 145.000411] <ffffffff8022c2ee>{do_exit+34} <ffffffff80423c6f>{_spin_unlock_irqrestore+11} [ 145.009454] <ffffffff8020b027>{do_divide_error+0} <ffffffff8020b22e>{do_invalid_op+145}
[  145.018334]        <ffffffff880f6984>{:ipath_core:ipath_verbs_send+362}
[ 145.025102] <ffffffff803f7d02>{tcp_v4_do_rcv+43} <ffffffff88092128>{:tg3:tg3_interrupt_tagged+51} [ 145.034840] <ffffffff8020a551>{error_exit+0} <ffffffff880f6984>{:ipath_core:ipath_verbs_send+362}
[  145.044606]        <ffffffff880f6b40>{:ipath_core:ipath_verbs_send+806}
[ 145.051390] <ffffffff8811869e>{:ib_ipath:ipath_do_rc_send+348} <ffffffff80232548>{do_timer+58} [ 145.060897] <ffffffff8020d0bb>{main_timer_handler+493} <ffffffff8022efc6>{tasklet_hi_action+105} [ 145.070569] <ffffffff8022ebc4>{__do_softirq+80} <ffffffff8020aa5a>{call_softirq+30} [ 145.079121] <ffffffff8020bc0a>{do_softirq+47} <ffffffff8020bbd1>{do_IRQ+62} [ 145.086947] <ffffffff80209b96>{ret_from_intr+0} <EOI> <ffffffff8026aafe>{kfree+417}
[  145.095523]        <ffffffff880e14ff>{:ib_uverbs:ib_uverbs_poll_cq+409}
[ 145.102291] <ffffffff880dfa27>{:ib_uverbs:ib_uverbs_write+196} <ffffffff8026f746>{vfs_write+212} [ 145.111972] <ffffffff8026f897>{sys_write+69} <ffffffff80209612>{system_call+126}
[  145.120482] Kernel panic - not syncing: Aiee, killing interrupt handler!
[  145.127265]

/proc/interrupts looks like this:
          CPU0       CPU1       CPU2       CPU3
  0:     107714     110040     109206     113504    IO-APIC-edge  timer
  1:        417       1287        405       1627    IO-APIC-edge  i8042
  8:          0          0          0          0    IO-APIC-edge  rtc
  9:          0          0          0          0   IO-APIC-level  acpi
 15:         50          0          0         23    IO-APIC-edge  ide1
50: 0 0 0 0 IO-APIC-level libata, ohci_hcd:usb2
 58:          0          0          0          0   IO-APIC-level  libata
 66:          0          0          0          0   IO-APIC-level  libata
 74:      15625          0          0         11   IO-APIC-level  eth0
90: 551 0 0 0 IO-APIC-level ipath_core 98: 0 0 0 0 IO-APIC-level NVidia CK804 233: 249 904 1161 4180 IO-APIC-level libata, ehci_hcd:usb1
NMI:        107        124        406        483
LOC:     440388     440365     440341     440317
ERR:          0
MIS:          0


Any ideas?

                                 Roger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux