On Fri, 2006-05-26 at 13:56 -0400, Bob Chiodini wrote: > On Fri, 2006-05-26 at 10:22 -0700, Rick Stevens wrote: > > On Fri, 2006-05-26 at 09:27 -0400, Randy Grimshaw wrote: > > > > > > I am trying to run a linux high availability cluster (failover pair) > > > using serial as one of the heartbeats. > > > > > > Due to numerous serial over-runs the systems are actually crashing > > > periodically. > > > > > > This is a very frustrating development for a system intended to provide > > > HA. (certainly not ha ha ha). > > > > > > I have updated to the latest bios. > > > I have checked RTS DTS XON XOFF etc. > > > This is happening with the stock and custom kernels. > > > This is happening on three pairs of servers. > > > The serial ports are detected as: > > > Serial: 8250/16550 driver $Revision: 1.90 $ 32 ports, IRQ > > > sharing enabled > > > serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A > > > > > > > > > Any advice would be greatly appreciated. > > > > The most common problem with overruns is running too high a baud rate. > > Remember, 16550s only have a 16-byte buffer in them. At 38,400 baud, > > you'll fill that buffer in about 260 microseconds. 9600 baud will fill > > the buffer in a tiny bit over 1 millisecond. Flow control tries to > > prevent overflows. > > > > Without flow control and if the machine is busy, the interrupt from the > > chip may not be serviced in time and you'll miss data because you've > > filled the buffer. Dropping the baud rate down should help, and make > > sure you use hardware (RTS/CTS) flow control. Remember that software > > (XON/XOFF) flow control requires the CPU to watch the buffer and send an > > XOFF when it gets full. You're already overrunning the buffer... > > software flow control won't help. > > > > Heartbeat stuff between nodes in a cluster is NOT a place to try to > > scrimp and save money! NICs are relatively cheap after all, they have > > much bigger buffers in them and they use DMA to transfer data to the > > processor instead of one-byte-at-a-time over the I/O ports. Frankly, > > NICS are far more reliable--especially for something this critical. > > > > At 8N1 and 38400 bits/second, that would be 3840 bytes/second or 240 > "FIFO fills" per second or 4.17 mS to fill the entire FIFO. Oops! Yup, 260 microseconds/character. Forgot to multiply by 16. Doh! :-( Hey! It's Friday and it's been a l-o-n-g week. > It sounds like something more is broken here. My old 486 running Linux > seemed to do better than that. I don't think so. Think how choppy things can get on a terminal emulation when the machine gets busy. Besides, the OP mentioned overruns and I think that's just what he's seeing--the FIFO is getting swamped before the CPU processes the interrupt. > The HA serial data rate is pretty low and should not be a problem. Not if it's bursty. If it's over 16 bytes and the machine doesn't service the interrupt...bad things will happen. > Can minicom transfer a file between the two servers via the serial > ports? Better, "can it do it while you're really flogging the CPU somehow?" ---------------------------------------------------------------------- - Rick Stevens, Senior Systems Engineer rstevens@xxxxxxxxxxxxxxx - - VitalStream, Inc. http://www.vitalstream.com - - - - This message printed using recycled bandwidth - ----------------------------------------------------------------------