Fedora Users — Re: xinetd delays in in.rshd responses (cluster problem, long)

Tim Prendergast wrote:

I have somewhat of a complex issue, hoping someone here may have some
insight.

I have a beowulf cluster of systems for a scientific application we run.
This cluster consists of 32 diskless slaves running fc4 w/ a monolithic
kernel, and a master node running FC4 with a custom kernel. The master
issues rsh commands to the slaves to grab a chunk of a job and process it,
then return it.

My issue is this -- we have an old RH9 cluster that is similar in design,
and the rsh commands (measured using `time rsh node2 uname -a`)  takes
around 0.050s to complete. On the FC4 system, we are running around

0.650-1.35s to complete the same command. I've traced the delay toxinetd orin.rshd, but am at a loss going any further. I've run some straces and Ican

see the delay occur. I've pasted the straces below for reference.

Does anyone have any idea why this delay is happening? These systems areall

wired up over gig-e (0.1-0.2ms pings round trip) and running dual 3.4ghz
Xeons w/ 2mb cache and 1gb mem in each slave, 4gb mem in the master. There
is a lot of processing power here, so I can't see a reason for the delay.

The PAM, rhosts, hosts.equiv, etc are all identical among the nodes (andthe

clusters).

[snip]
========================

Here you can clearly see the delay happen:
<cut and paste section of interest from above>

16:18:55.297196 writev(3, [{"root\0", 5}, {"root\0", 5}, {"uname -a\0",9}],

3) = 19 <0.000038>
16:18:55.297332 read(3, "\0", 1)        = 1 <0.632181>
16:18:55.929633 rt_sigprocmask(SIG_SETMASK, [], [URG], 8) = 0 <0.000039>
16:18:55.929763 setuid32(0)             = 0 <0.000039>
<end cut and paste>

It looks like it takes .63s to write the data to the socket and get the
response, which I find hard to fathom (especially since anything outside of
xinetd's realm appears to be really fast over the network).

Thoughts?

-Tim

Can you strace the server-side, and also capture the packets on both interfacesto see what times they are sent/received.

The delay comes when the client sends the login information with the username.It could be a delay in mapping a username to a userid. What authenticationmechanism is running on both systems? Is nscd running to cache the userinformation on both systems, or does it require a network lookup from NIS, LDAP etc?

--
Nigel Wade, System Administrator, Space Plasma Physics Group,
            University of Leicester, Leicester, LE1 7RH, UK
E-mail :    nmw@xxxxxxxxxxxx
Phone :     +44 (0)116 2523548, Fax : +44 (0)116 2523555