Rick Stevens wrote:
On Fri, 2006-05-26 at 13:56 -0400, Bob Chiodini wrote:
On Fri, 2006-05-26 at 10:22 -0700, Rick Stevens wrote:
On Fri, 2006-05-26 at 09:27 -0400, Randy Grimshaw wrote:
I am trying to run a linux high availability cluster (failover pair)
using serial as one of the heartbeats.
Due to numerous serial over-runs the systems are actually crashing
periodically.
This is a very frustrating development for a system intended to provide
HA. (certainly not ha ha ha).
I have updated to the latest bios.
I have checked RTS DTS XON XOFF etc.
This is happening with the stock and custom kernels.
This is happening on three pairs of servers.
The serial ports are detected as:
Serial: 8250/16550 driver $Revision: 1.90 $ 32 ports, IRQ
sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
Any advice would be greatly appreciated.
The most common problem with overruns is running too high a baud rate.
Remember, 16550s only have a 16-byte buffer in them. At 38,400 baud,
you'll fill that buffer in about 260 microseconds. 9600 baud will fill
the buffer in a tiny bit over 1 millisecond. Flow control tries to
prevent overflows.
Without flow control and if the machine is busy, the interrupt from the
chip may not be serviced in time and you'll miss data because you've
filled the buffer. Dropping the baud rate down should help, and make
sure you use hardware (RTS/CTS) flow control. Remember that software
(XON/XOFF) flow control requires the CPU to watch the buffer and send an
XOFF when it gets full. You're already overrunning the buffer...
software flow control won't help.
Heartbeat stuff between nodes in a cluster is NOT a place to try to
scrimp and save money! NICs are relatively cheap after all, they have
much bigger buffers in them and they use DMA to transfer data to the
processor instead of one-byte-at-a-time over the I/O ports. Frankly,
NICS are far more reliable--especially for something this critical.
At 8N1 and 38400 bits/second, that would be 3840 bytes/second or 240
"FIFO fills" per second or 4.17 mS to fill the entire FIFO.
Oops! Yup, 260 microseconds/character. Forgot to multiply by 16.
Doh! :-( Hey! It's Friday and it's been a l-o-n-g week.
It sounds like something more is broken here. My old 486 running Linux
seemed to do better than that.
I don't think so. Think how choppy things can get on a terminal
emulation when the machine gets busy. Besides, the OP mentioned
overruns and I think that's just what he's seeing--the FIFO is getting
swamped before the CPU processes the interrupt.
The HA serial data rate is pretty low and should not be a problem.
Not if it's bursty. If it's over 16 bytes and the machine doesn't
service the interrupt...bad things will happen.
Can minicom transfer a file between the two servers via the serial
ports?
Better, "can it do it while you're really flogging the CPU somehow?"
Shouldn't RTS/CTS flow control take care of this problem? Are the
correct pins wired in your serial cable?
And just how many bytes do you need to implement a heartbeat? Seems
like 1 a second would get the job done.
Regards,
John