Tim Prendergast wrote:
I have somewhat of a complex issue, hoping someone here may have some
insight.
I have a beowulf cluster of systems for a scientific application we run.
This cluster consists of 32 diskless slaves running fc4 w/ a monolithic
kernel, and a master node running FC4 with a custom kernel. The master
issues rsh commands to the slaves to grab a chunk of a job and process it,
then return it.
My issue is this -- we have an old RH9 cluster that is similar in design,
and the rsh commands (measured using `time rsh node2 uname -a`) takes
around 0.050s to complete. On the FC4 system, we are running around
0.650-1.35s to complete the same command. I've traced the delay to
xinetd or
in.rshd, but am at a loss going any further. I've run some straces and I
can
see the delay occur. I've pasted the straces below for reference.
Does anyone have any idea why this delay is happening? These systems are
all
wired up over gig-e (0.1-0.2ms pings round trip) and running dual 3.4ghz
Xeons w/ 2mb cache and 1gb mem in each slave, 4gb mem in the master. There
is a lot of processing power here, so I can't see a reason for the delay.
The PAM, rhosts, hosts.equiv, etc are all identical among the nodes (and
the
clusters).
[snip]
========================
Here you can clearly see the delay happen:
<cut and paste section of interest from above>
16:18:55.297196 writev(3, [{"root\0", 5}, {"root\0", 5}, {"uname -a\0",
9}],
3) = 19 <0.000038>
16:18:55.297332 read(3, "\0", 1) = 1 <0.632181>
16:18:55.929633 rt_sigprocmask(SIG_SETMASK, [], [URG], 8) = 0 <0.000039>
16:18:55.929763 setuid32(0) = 0 <0.000039>
<end cut and paste>
It looks like it takes .63s to write the data to the socket and get the
response, which I find hard to fathom (especially since anything outside of
xinetd's realm appears to be really fast over the network).
Thoughts?
-Tim
Can you strace the server-side, and also capture the packets on both interfaces
to see what times they are sent/received.
The delay comes when the client sends the login information with the username.
It could be a delay in mapping a username to a userid. What authentication
mechanism is running on both systems? Is nscd running to cache the user
information on both systems, or does it require a network lookup from NIS, LDAP etc?
--
Nigel Wade, System Administrator, Space Plasma Physics Group,
University of Leicester, Leicester, LE1 7RH, UK
E-mail : nmw@xxxxxxxxxxxx
Phone : +44 (0)116 2523548, Fax : +44 (0)116 2523555