Re: xinetd delays in in.rshd responses (cluster problem, long)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tim Prendergast wrote:
I have somewhat of a complex issue, hoping someone here may have some
insight.

I have a beowulf cluster of systems for a scientific application we run.
This cluster consists of 32 diskless slaves running fc4 w/ a monolithic
kernel, and a master node running FC4 with a custom kernel. The master
issues rsh commands to the slaves to grab a chunk of a job and process it,
then return it.

My issue is this -- we have an old RH9 cluster that is similar in design,
and the rsh commands (measured using `time rsh node2 uname -a`)  takes
around 0.050s to complete. On the FC4 system, we are running around
0.650-1.35s to complete the same command. I've traced the delay to xinetd or in.rshd, but am at a loss going any further. I've run some straces and I can
see the delay occur. I've pasted the straces below for reference.

Does anyone have any idea why this delay is happening? These systems are all
wired up over gig-e (0.1-0.2ms pings round trip) and running dual 3.4ghz
Xeons w/ 2mb cache and 1gb mem in each slave, 4gb mem in the master. There
is a lot of processing power here, so I can't see a reason for the delay.
The PAM, rhosts, hosts.equiv, etc are all identical among the nodes (and the
clusters).

[snip]
========================

Here you can clearly see the delay happen:
<cut and paste section of interest from above>
16:18:55.297196 writev(3, [{"root\0", 5}, {"root\0", 5}, {"uname -a\0", 9}],
3) = 19 <0.000038>
16:18:55.297332 read(3, "\0", 1)        = 1 <0.632181>
16:18:55.929633 rt_sigprocmask(SIG_SETMASK, [], [URG], 8) = 0 <0.000039>
16:18:55.929763 setuid32(0)             = 0 <0.000039>
<end cut and paste>

It looks like it takes .63s to write the data to the socket and get the
response, which I find hard to fathom (especially since anything outside of
xinetd's realm appears to be really fast over the network).

Thoughts?

-Tim


Can you strace the server-side, and also capture the packets on both interfaces to see what times they are sent/received.

The delay comes when the client sends the login information with the username. It could be a delay in mapping a username to a userid. What authentication mechanism is running on both systems? Is nscd running to cache the user information on both systems, or does it require a network lookup from NIS, LDAP etc?

--
Nigel Wade, System Administrator, Space Plasma Physics Group,
            University of Leicester, Leicester, LE1 7RH, UK
E-mail :    nmw@xxxxxxxxxxxx
Phone :     +44 (0)116 2523548, Fax : +44 (0)116 2523555


[Index of Archives]     [Current Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [Yosemite Photos]     [KDE Users]     [Fedora Tools]     [Fedora Docs]

  Powered by Linux