Re: [newbie:] Bonnie++2 hangs recent 2.6 kernels? Bash keeps looping in waitpid(), eating 100% CPU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Friday 14 September 2007 00:46, Frantisek Rysanek wrote:
> Dear everyone,
>
> apologies in advance for a silly question...
>
> I'm using a homebrew stripped-down mini-distro based on Fedora 5,
> with various newer kernels, on a live CD, to test hardware with.
> The live CD is composed by means of scripted binary copy of the key
> necessary components (libc, init, bash, /dev/, /etc/, you know the
> rest...) - it's almost like rolling your own MS-DOS boot floppy.
> A minimum system is about 4-10 MB, a neat firewall takes up
> about 22 MB.
>
> Recently I've stubled over what seems like a lasting bug
> in the Linux kernel. Excuse me for that accusation, which is
> admittedly based on rather vague data, dated versions
> of the user-space software (libc, bash...), and a homebrew
> hackey distro.
>
> First impression:
> looped execution of Bonnie++2 makes bash go berserk.
> There are two possible flawed behaviors:
>
> 1) the bash process that's waiting for Bonnie++2 to return,
>    starts looping inside the last waitpid() call I believe,
>    eating 100% CPU.
>    At least that's what 'top' + 'strace -p <bash PID>' would
>    suggest. The top and strace have to be running beforehand,
>    as the same happens to the bash process on any other virtual
>    console, if you try to run any further command. (The further
>    command doesn't seem to get executed anymore.)
>
> 2) the bash processes don't start eating 100% CPU, but any further
>    command that you try to execute returns immediately with
>    a segfault.
>
> I boot the CD with just bare bash on all 6 virtual consoles.
> I mount a previously created EXT3 FS (several hundred GB
> to over 1 TB) on a mountpoint, `cd` into the mountpoint
> on one or two consoles, and run
>
>  while true; do bonnie++2 -u root -s 4096; done
>
> Then I run 'iostat 2', 'top' and 'strace -p <bash PID>' on the
> remaining consoles. I try running some other command now and then, to
> make the paging and block IO subsystems load some more blocks from
> the CD.
>
> I believe the `top` output suggests that the Bonnie processes don't
> eat all that much RAM, but the kernel-space buffering eats almost all
> of it. Only about 50 Megs remain truly "free", most of the RAM gets
> "cached". The system stabilizes at this balance, and a few minutes
> later it hangs in the aforementioned way.
>
> This happens without a swap. If I mkswap+swapon some free hard drive,
> the symptoms seem somewhat more difficult to reproduce, but do occur
> after a somewhat longer period of time.
>
> The symptoms are fairly easily reproduced on 2.6.16.18 through
> 2.6.16.48, as well as 2.6.18.8. On 2.6.22.6 it seems to take a bit
> more time to reproduce the problem.
>
> I've reproduced the problem on three different dual Xeon
> boxes, all of them SuperMicro of different sizes/generations,
> all of them upgraded to the latest BIOS (now showing no more
> IRQ routing mischiefs).
> The hardware setups are along the lines of
> - Intel 7501 chipset, dual Xeon Northwood, 1 GB RAM,
>    Adaptec 79xx HBA, external RAID (~80 MBps),
>    internal Adaptec 2120 RAID (~50 MBps)
> - Intel 7520 chipset, dual Xeon Irwindale, 2 GB RAM,
>    several internal U320 SCSI drives via Adaptec 79xx HBA,
>    an external RAID (~80 MBps) via LSI 20320 HBA (Fusion MPT)
> - Intel 7520 chipset, dual Xeon Nocona, 1 GB RAM,
>    internal LSI MegaRAID SATA150-6 with 6 disk drives.
>
> I've never seen this before I started using bonnie++2 as a load
> generator :-) Both my hardware systems and my Linux CD are otherwise
> perfectly stable, under sequential IO, cpuburn, older versions of
> Bonnie on Linux 2.4 / FreeBSD etc.
> I know what it looks like when there's a hardware problem and I know
> how to prove/deny a hardware problem by selective A/B-style hardware
> replacements, I'm fairly good at shielding away hardware unstability.
>
> Should I start from compiling a fresh libc + bash + whatever else?
> Any ideas are welcome :-)

Can you see if it is looping in userspace or kernel? Can you kill -9
the process?

Are you able to test with the latest 2.6.23-rc kernel? If not (or if it
still has the same problem), then can you get the output of sysrq+T
and three sysrq+P calls, please? (this might help work out where in
kernel it is spinning).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux