Re: PROBLEM: bug in e1000 module causes very high CPU load

<snip>
> Yes, let the server act as usual, it just starts happening out of the blue.
> No new hardware has been added or removed, no new programs has been
> installed.

"Me too"

[2.] Full description of the problem/report:
Last week I had the same issue twice. In short: load goes through the
roof from 2 to 950 within five minutes. Never been able to use the
console -no response-. These servers where part of a HA nfs cluster so
we had to pull the plug (we dont have a mon plugin for extreme load
values yet). Once triggered, it seems, there is noway back. But once
fail-over has occured the other server does break - so my simplistic
conclusion is that this behaviour isnt based on usage patterns by our
nfs clients beating the server.

# lspci | grep -i giga
02:03.0 Ethernet controller: Intel Corporation 82546EB Gigabit
Ethernet Controller (Copper) (rev 01)
02:03.1 Ethernet controller: Intel Corporation 82546EB Gigabit
Ethernet Controller (Copper) (rev 01)

# more /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  0:   26571295   26559117   40034487   40909166    IO-APIC-edge  timer
  1:          9          0          0          0    IO-APIC-edge  i8042
  2:          0          0          0          0          XT-PIC  cascade
  8:          1          0          0          0    IO-APIC-edge  rtc
 12:         59          0          0          0    IO-APIC-edge  i8042
 15:         20          0          0          0    IO-APIC-edge  ide1
153:          0          0          0          0   IO-APIC-level  uhci_hcd
161:          0          0          0          0   IO-APIC-level  uhci_hcd
169:          0          0          0          0   IO-APIC-level  uhci_hcd
177:     176067     319337     329774     239612   IO-APIC-level  aic79xx
185:     534649    3958315    6395749    9602510   IO-APIC-level  aic79xx
193:   88511141          0          0          0   IO-APIC-level  eth0
201:         60          0   24342441          0   IO-APIC-level  eth1
NMI:          0          0          0          0
LOC:  134085019  134085069  134084971  134085090
ERR:          0
MIS:          0

[7.1.] Software (add the output of the ver_linux script here)
Linux somewhere 2.6.9-11.ELsmp #1 SMP Fri May 20 18:26:27 EDT 2005
i686 i686 i386 GNU/Linux

Gnu C                  3.4.3
Gnu make               3.80
binutils               2.15.92.0.2
util-linux             2.12a
mount                  2.12a
module-init-tools      3.1-pre5
e2fsprogs              1.35
reiserfsprogs          line
reiser4progs           line
pcmcia-cs              3.2.7
quota-tools            3.12.
PPP                    2.4.2
isdn4k-utils           3.3
nfs-utils              1.0.6
Linux C Library        2.3.4
Dynamic linker (ldd)   2.3.4
Procps                 3.2.3
Net-tools              1.60
Kbd                    1.12
Sh-utils               5.2.1
Modules Loaded         nfsd exportfs drbd nfs lockd sunrpc md5 ipv6
parport_pc lp parport autofs4 w83781d adm1021 i2c_sensor i2c_i801
i2c_core dm_mod uhci_hcd hw_random e1000 floppy sg ext3 jbd aic79xx
sd_mod scsi_mod

[7.2.] Processor information (from /proc/cpuinfo):
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 5
cpu MHz         : 2799.959
cache size      : 512 KB
physical id     : 0
siblings        : 2
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse ss
e2 ss ht tm pbe cid xtpr
bogomips        : 5521.40
and so on... because of two xeon's in HT-mode

[7.5.] PCI information ('lspci -vvv' as root) ((Just the ethernet))
02:03.0 Ethernet controller: Intel Corporation 82546EB Gigabit
Ethernet Controller (Copper) (rev 01)
        Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
        Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort- <MAbort- >SERR- <PERR-
        Latency: 64 (63750ns min), Cache Line Size 08
        Interrupt: pin A routed to IRQ 193
        Region 0: Memory at fc200000 (64-bit, non-prefetchable) [size=128K]
        Region 4: I/O ports at 3000 [size=64]
        Capabilities: [dc] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [e4] PCI-X non-bridge device.
                Command: DPERE- ERO+ RBC=0 OST=0
                Status: Bus=2 Dev=3 Func=0 64bit+ 133MHz+ SCD- USC-,
DC=simple, DMMRBC=2, DMOST=0, DMCRS=1, RSCEM-
        Capabilities: [f0] Message Signalled Interrupts: 64bit+
Queue=0/0 Enable-
                Address: 0000000000000000  Data: 0000

Is there a method which can give hints about what was going on during
the sharply rising load? My guess is that even monitoring/sampling
aint doable anymore if the bad situation occurs. Tips on obtaining
information about subjects like:
- who was using which tcp/udp connection with what bandwidth
- who was generating how many read/writes on which filesystem incl. location
- etc etc.
are more then welcome too. Does using profile=2 and storing
readprofile output to files every 5 seconds give enough information to
tacle the source of this problem?

Please CC, thanks.

Best regards,
Leroy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- Re: PROBLEM: bug in e1000 module causes very high CPU load
  - From: Jesse Brandeburg <jesse.brandeburg@gmail.com>
- RE: PROBLEM: bug in e1000 module causes very high CPU load
  - From: "ph0x" <ph0x@freequest.net>

Prev by Date: Re: [2.6 patch] i386: always use 4k stacks
Next by Date: Re: [rfc][patch] Avoid taking global tasklist_lock for single threaded process at getrusage()
Previous by thread: RE: PROBLEM: bug in e1000 module causes very high CPU load
Next by thread: [no subject]
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]