Re: serial port issues on IBM xseries with FC4 and High Availability heartbeat

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2006-05-26 at 13:56 -0400, Bob Chiodini wrote:
> On Fri, 2006-05-26 at 10:22 -0700, Rick Stevens wrote:
> > On Fri, 2006-05-26 at 09:27 -0400, Randy Grimshaw wrote:
> > > 
> > > I am trying to run a linux high availability cluster (failover pair)
> > > using serial as one of the heartbeats.
> > > 
> > > Due to numerous serial over-runs the systems are actually crashing
> > > periodically.
> > > 
> > > This is a very frustrating development for a system intended to provide
> > > HA. (certainly not ha ha ha).
> > > 
> > > I have updated to the latest bios.
> > > I have checked RTS DTS XON XOFF etc.
> > > This is happening with the stock and custom kernels.
> > > This is happening on three pairs of servers.
> > > The serial ports are detected as:
> > >        Serial: 8250/16550 driver $Revision: 1.90 $ 32 ports, IRQ
> > > sharing enabled
> > >        serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
> > > 
> > > 
> > > Any advice would be greatly appreciated.
> > 
> > The most common problem with overruns is running too high a baud rate.
> > Remember, 16550s only have a 16-byte buffer in them.  At 38,400 baud,
> > you'll fill that buffer in about 260 microseconds.  9600 baud will fill
> > the buffer in a tiny bit over 1 millisecond.  Flow control tries to
> > prevent overflows.
> > 
> > Without flow control and if the machine is busy, the interrupt from the
> > chip may not be serviced in time and you'll miss data because you've
> > filled the buffer.  Dropping the baud rate down should help, and make
> > sure you use hardware (RTS/CTS) flow control.  Remember that software
> > (XON/XOFF) flow control requires the CPU to watch the buffer and send an
> > XOFF when it gets full.  You're already overrunning the buffer...
> > software flow control won't help.
> > 
> > Heartbeat stuff between nodes in a cluster is NOT a place to try to
> > scrimp and save money!  NICs are relatively cheap after all, they have
> > much bigger buffers in them and they use DMA to transfer data to the
> > processor instead of one-byte-at-a-time over the I/O ports.  Frankly,
> > NICS are far more reliable--especially for something this critical.
> > 
> 
> At 8N1 and 38400 bits/second, that would be 3840 bytes/second or 240
> "FIFO fills" per second or 4.17 mS to fill the entire FIFO.

Oops!  Yup, 260 microseconds/character.  Forgot to multiply by 16.
Doh! :-(  Hey!  It's Friday and it's been a l-o-n-g week.

> It sounds like something more is broken here.  My old 486 running Linux
> seemed to do better than that.

I don't think so.  Think how choppy things can get on a terminal
emulation when the machine gets busy.  Besides, the OP mentioned
overruns and I think that's just what he's seeing--the FIFO is getting
swamped before the CPU processes the interrupt.

> The HA serial data rate is pretty low and should not be a problem.

Not if it's bursty.  If it's over 16 bytes and the machine doesn't
service the interrupt...bad things will happen.

> Can minicom transfer a file between the two servers via the serial
> ports?

Better, "can it do it while you're really flogging the CPU somehow?"
----------------------------------------------------------------------
- Rick Stevens, Senior Systems Engineer     rstevens@xxxxxxxxxxxxxxx -
- VitalStream, Inc.                       http://www.vitalstream.com -
-                                                                    -
-           This message printed using recycled bandwidth            -
----------------------------------------------------------------------


[Index of Archives]     [Current Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [Yosemite Photos]     [KDE Users]     [Fedora Tools]     [Fedora Docs]

  Powered by Linux