On Fri, 2006-05-26 at 10:22 -0700, Rick Stevens wrote: > On Fri, 2006-05-26 at 09:27 -0400, Randy Grimshaw wrote: > > > > I am trying to run a linux high availability cluster (failover pair) > > using serial as one of the heartbeats. > > > > Due to numerous serial over-runs the systems are actually crashing > > periodically. > > > > This is a very frustrating development for a system intended to provide > > HA. (certainly not ha ha ha). > > > > I have updated to the latest bios. > > I have checked RTS DTS XON XOFF etc. > > This is happening with the stock and custom kernels. > > This is happening on three pairs of servers. > > The serial ports are detected as: > > Serial: 8250/16550 driver $Revision: 1.90 $ 32 ports, IRQ > > sharing enabled > > serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A > > > > > > Any advice would be greatly appreciated. > > The most common problem with overruns is running too high a baud rate. > Remember, 16550s only have a 16-byte buffer in them. At 38,400 baud, > you'll fill that buffer in about 260 microseconds. 9600 baud will fill > the buffer in a tiny bit over 1 millisecond. Flow control tries to > prevent overflows. > > Without flow control and if the machine is busy, the interrupt from the > chip may not be serviced in time and you'll miss data because you've > filled the buffer. Dropping the baud rate down should help, and make > sure you use hardware (RTS/CTS) flow control. Remember that software > (XON/XOFF) flow control requires the CPU to watch the buffer and send an > XOFF when it gets full. You're already overrunning the buffer... > software flow control won't help. > > Heartbeat stuff between nodes in a cluster is NOT a place to try to > scrimp and save money! NICs are relatively cheap after all, they have > much bigger buffers in them and they use DMA to transfer data to the > processor instead of one-byte-at-a-time over the I/O ports. Frankly, > NICS are far more reliable--especially for something this critical. > At 8N1 and 38400 bits/second, that would be 3840 bytes/second or 240 "FIFO fills" per second or 4.17 mS to fill the entire FIFO. It sounds like something more is broken here. My old 486 running Linux seemed to do better than that. The HA serial data rate is pretty low and should not be a problem. Can minicom transfer a file between the two servers via the serial ports? Bob...